In this exercise, you will learn how to use the Phi-3 ONNX Small-Language Model (SLM) as a sidecar for your application.
Phi-3 is a compact, efficient SLM designed to handle natural language processing tasks with minimal computational overhead. Unlike large language models (LLMs), SLMs like Phi-3 are optimized for scenarios with limited resources, providing quick and contextually relevant responses without high costs. Running Phi-3 on ONNX allows it to function effectively as a sidecar, bringing conversational AI capabilities to diverse environments, including Linux App Service. By integrating an SLM as a sidecar, developers can add sophisticated language features to applications, enhancing user engagement while maintaining operational efficiency.
In this exercise, you will implement a chat application that lets users inquire about products in your fashion store. Powered by Phi-3, the app will provide real-time responses, enhancing the customer experience with on-demand product information and styling suggestions.
- Open a browser and go to Azure Portal using the credentials provided.
- Click on App Service in the top navigation bar.
- From the list of apps, click on Exercise-4 application.
- On the Overview page, you can view some properties of the app:
- The app runs on P2mv3, a Premium memory-optimized SKU offered by App Service. Learn more about Premium SKUs here.
- This application is deployed with .NET 8.
- Click on Deployment Center in the left navigation.
- The application includes a Phi-3 sidecar as part of its setup.
In your lab fork, navigate to the folder Exercise 4
. Inside, you will find two projects:
- dotnetfashionassistant: This is the frontend Blazor application.
- phi-3-sidecar: A Python FastAPI which exposes an endpoint to invoke the Phi-3 ONNX model, configured to run as a sidecar container. With the Sidecar feature, you can have the main application and the sidecar running different language stacks.
-
Open the
phi-3-sidecar
project. -
Navigate to model_api.py:
- Initialize the model in the constructor:
model_path = "/app/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4" model = og.Model(model_path)
- Phi-3 ONNX is an offline model, accessible from the filesystem. Here, the 4K CPU-based model is utilized.
- Initialize the System and Prompt:
chat_template = '<|system|>\nYou are an AI assistant that helps people find concise information about products. Keep your responses brief and focus on key points. Limit the number of product key features to no more than three.<|end|>\n<|user|>\n{input} <|end|>\n<|assistant|>' input = f"{data.user_message} Product: {data.product_name}. Description: {data.product_description}" prompt = chat_template.format(input=input)
- The API method predict accepts three parameters – the user prompt, product name, and product description.
- Pass the prompt to the model and stream the response token by token:
generator = og.Generator(model, params) ..... while not generator.is_done(): generator.compute_logits() generator.generate_next_token() new_token = generator.get_next_tokens()[0] generated_text += tokenizer_stream.decode(new_token) yield tokenizer_stream.decode(new_token)
- Initialize the model in the constructor:
-
Open the Dockerfile:
- Use the Dockerfile to create the container image. In the Dockerfile, we are copying the model and exposing the container port.
# Step 6: Copy the entire directory containing the ONNX model and its data files COPY ./Phi-3-mini-4k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 /app/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4 # Step 7: Copy the API script into the container COPY ./model_api.py /app/ # Step 8: Expose the port the app runs on EXPOSE 8000 # Step 9: Run the API using Uvicorn CMD ["uvicorn", "model_api:app", "--host", "0.0.0.0", "--port", "8000"]
- Use the Dockerfile to create the container image. In the Dockerfile, we are copying the model and exposing the container port.
- Open the dotnetfashionassistant project in VS Code.
- Open Program.cs and configure the endpoint for the sidecar application:
builder.Services.AddScoped(sp => new HttpClient { BaseAddress = new Uri(builder.Configuration["FashionAssistantAPI:Url"] ?? "http://localhost:8000/predict") }); builder.Services.AddHttpClient();
- Open Home.razor and navigate to the
@code
section:-
Call the backend API using
HttpRequestMessage
, passing in the user query, product name, and description:var request = new HttpRequestMessage(HttpMethod.Post, configuration["FashionAssistantAPI:Url"]); request.Headers.Add(HeaderNames.Accept, "application/json"); var queryData = new Dictionary<string, string> { {"user_prompt", message}, {"product_name", selectedItem.Name}, {"product_description", selectedItem.Description } };
-
Read and stream the response:
using (HttpResponseMessage responseMessage = await client.SendAsync(request, HttpCompletionOption.ResponseHeadersRead)) { responseMessage.EnsureSuccessStatusCode(); if (responseMessage.Content is object) { using (Stream streamToReadFrom = await responseMessage.Content.ReadAsStreamAsync()) { using (StreamReader reader = new StreamReader(streamToReadFrom)) { char[] buffer = new char[8192]; int bytesRead; while ((bytesRead = await reader.ReadAsync(buffer, 0, buffer.Length)) > 0) { response += new string(buffer, 0, bytesRead); StateHasChanged(); } } } } }
-
- Right-lick on the
dotnetfashioassistant
project in Codespace selectOpen in Integrated terminal
.
- To publish the web app, run the command in the opened terminal, run
dotnet publish -c Release -o ./bin/Publish
- Right click on bin--> publish folder and select Deploy to WebApp option
-
Choose the
exercise4
app. -
After deployment, wait a few minutes for the application to restart.
Once the application is live, navigate to it and try asking questions like “Tell me more about this shirt” or “How do I pair this shirt?”
End of Exercise 4.