Notebooks
M
Meta Llama
Azure Api Example

Azure Api Example

llamaAIvllmmachine-learning3p-integrationsllama2LLMazurellama-cookbookAzure MaaSPythonfinetuningpytorchlangchain

Use Azure API with Llama 3.1

This notebook shows examples of how to use Llama 3.1 APIs offered by Microsoft Azure. We will cover:

  • HTTP requests API usage for Llama 3.1 instruct models in CLI
  • HTTP requests API usage for Llama 3.1 instruct models in Python
  • Plug the APIs into LangChain
  • Wire the model with Gradio to build a simple chatbot with memory

Prerequisite

Before we start building with Azure Llama 3.1 APIs, there are certain steps we need to take to deploy the models:

  • Register for a valid Azure account with subscription here
  • Take a quick look on what is the Azure AI Studio and navigate to the website from the link in the article
  • Follow the demos in the article to create a project and resource group.
  • For Llama 3.1 instruct models from Model catalog, click Deploy in the model page and select "Serverless API with Azure AI Content Safety". Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.
  • For Llama 3.1 pretrained models, Azure currently only support manual deployment under regular subscription. This means you will need to acquire a virtual machine with managed compute resource. We won't cover it here in this tutorial.

For more information, you should consult Azure's official documentation here for model deployment and inference.

HTTP Requests API Usage in CLI

Basics

The usage and schema of the API are identical to Llama 3 API hosted on Azure.

For using the REST API, You will need to have an Endpoint url and Authentication Key associated with that endpoint.
This can be acquired from previous steps.

In this chat completion example for instruct model, we use a simple curl call for illustration. There are three major components:

  • The host-url is your endpoint url with completion schema.
  • The headers defines the content type as well as your api key.
  • The payload or data, which is your prompt detail and model hyper parameters.

The host-url needs to be /v1/chat/completions and the request payload to include roles in conversations. Here is a sample payload:

{ 
  "messages": [ 
    { 
      "content": "You are a helpful assistant.", 
      "role": "system" 
},  
    { 
      "content": "Hello!", 
      "role": "user" 
    } 
  ], 
  "max_tokens": 50, 
} 

Here is a sample curl call for chat completion

[ ]

Streaming

One fantastic feature the API offers is the streaming capability.
Streaming allows the generated tokens to be sent as data-only server-sent events whenever they become available.
This is extremely important for interactive applications such as chatbots, so the user is always engaged.

To use streaming, simply set "stream":true as part of the request payload.
In the streaming mode, the REST API response will be different from non-streaming mode.

Here is an example:

[ ]

As you can see the result comes back as a stream of data objects, each contains generated information including a choice.
The stream terminated by a data:[DONE]\n\n message.

Content Safety Filtering

If you enabled content filtering during deployment, Azure Llama 3.1 API endpoints will have content safety feature turned on. Both input prompt and output tokens are filtered by this service automatically.
To know more about the impact to the request/response payload, please refer to official guide here.

For model input and output, if the filter detects there is harmful content, the generation will error out with additional information.

If you disabled content filtering during deployment, Llama models had content safety built-in for generation. It will refuse to answer your questions if any harmful content was detected.

Here is an example prompt that triggered content safety filtering:

[ ]

HTTP Requests API Usage in Python

Besides calling the API directly from command line tools, you can also programmatically call them in Python.

Here is an example for the instruct model:

[ ]

However in this example, the streamed data content returns back as a single payload. It didn't stream as a serial of data events as we wished. To build true streaming capabilities utilizing the API endpoint, we will utilize the requests library instead.

Streaming in Python

Requests library is a simple HTTP library for Python built with urllib3. It automatically maintains the keep-alive and HTTP connection pooling. With the Session class, we can easily stream the result from our API calls.

Here is a quick example:

[ ]

Use Llama 3.1 API with LangChain

In this section, we will demonstrate how to use Llama 3.1 APIs with LangChain, one of the most popular framework to accelerate building your AI product.
One common solution here is to create your customized LLM instance, so you can add it to various chains to complete different tasks.
In this example, we will use the AzureMLChatOnlineEndpoint class LangChain provides to build a customized LLM instance. This particular class is designed to take in Azure endpoint and API keys as inputs and wire it with HTTP calls. So the underlying of it is very similar to how we used urllib.request library to send RESTful calls in previous examples to the Azure Endpoint.

First, let's install dependencies:

[ ]

Once all dependencies are installed, you can directly create a llm instance based on AzureMLChatOnlineEndpoint as follows:

[ ]

However, you might wonder what is the CustomOpenAIChatContentFormatter in the context when creating the llm instance?
The CustomOpenAIChatContentFormatter is a handler class for transforming the request and response of an AzureML endpoint to match with required schema. Since there are various models in the Azure model catalog, each of which needs to handle the data accordingly.
In our case, we can use the default CustomOpenAIChatContentFormatter which can handle Llama model schemas. If you need to have special handlings, you can customize this specific class.

Once you have the llm ready, you can simple inference it by:

[ ]

Here is an example that you can create a translator chain with the llm instance and translate English to French:

[ ]

Build a chatbot with Llama 3.1 API

In this section, we will build a simple chatbot using Azure Llama 3.1 API, LangChain and Gradio's ChatInterface with memory capability.

Gradio is a framework to help demo your machine learning model with a web interface. We also have a dedicated Gradio chatbot example built with Llama 3 on-premises with RAG.

First, let's install Gradio dependencies.

[ ]

Let's use AzureMLChatOnlineEndpoint class from the previous example.
In this example, we have three major components:

  1. Chatbot UI hosted as web interface by Gradio. These are the UI logics that render our model predictions.
  2. Model itself, which is the core component that ingests prompts and returns an answer back.
  3. Memory component, which stores previous conversation context. In this example, we will use conversation window buffer which logs context in certain time window in the past.

All of them are chained together using LangChain.

[ ]

After successfully executing the code above, a chat interface should appear as the interactive output or you can open the localhost url in your selected browser window. You can see how amazing it is to build a AI chatbot just in few lines of code.

This concludes our tutorial and examples. Here are some additional reference: