I. Introduction
Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering new and sophisticated solutions to real-world problems.
LLMs can be used for AI-assisted development, which can drastically boost productivity in some cases. The goal is to leverage AI language models to change the development process forever. One could think of it as adding a layer of human language abstraction on top of high-level programming languages. However, this article will focus on the second impact of LLMs: utilizing them to develop software solutions that address real-world problems.
The goal is to leverage AI language models to design, build, test, deploy, and observe a software application centered around LLM capabilities. We will walk through the entire journey, from initial conceptualization to hands-on implementation, deployment, and monitoring of the application.
This blogpost aims to provide a low-barrier-to-entry introduction to building LLM-based applications, sharing both findings and the challenges encountered along the way.
II. Requirements
This project aims to explore LLM interaction mechanics firsthand, prioritizing understanding over extensive features. The requirements will be kept minimal to focus on the core LLM interaction process.
As an introductory example of utilizing an LLM, we will implement a basic Book Summarizer. The application will allow users to interact with a model to get a summarized version of a book. To keep the scope concise and centered on LLM interactions, we will only include requirements that necessitate interaction with the AI model. Here is the refined list of requirements:
- Get Book Summary: Users should be able to input a book title and receive a summary generated by the LLM.
- Adjust Temperature Parameter: Users should be able to set the temperature parameter to influence the creativity and variability of the output.
- Add Custom Parameters: Users should be able to add custom parameters to the query for more tailored responses.
This minimalist approach will demonstrate essential LLM integration in a web application.
III. Architectural Overview
Starting from the narrow requirements list, we can already outline the application's architecture. Figure 1: Architectural overview of the book summarizer application
As demonstrated in Figure 1, the summary (which can be abstracted to any desired output) is generated following this workflow:
- User Interaction: The user enters the book title in a text field and adds parameters if desired. This payload is then sent to the backend application.
- Backend Processing: The backend application processes the prompt and parameters to generate a more comprehensive prompt with added metadata crucial to the quality of the output. The backend does not have GenAI capabilities but can interact with an LLM. In this instance, a self-hosted model is used, although any other model like OpenAI’s or Anthropic’s could be used.
- LLM Interaction: The Ollama server receives the payload and begins generating a response stream based on the pulled model.
- Frontend Update: The response stream is chunked and sent to the frontend application to update its UI, specifically the field where the result should be displayed.
FastAPI serves as a backend intermediary between the frontend and the LLM. While direct frontend-LLM interaction is possible, this structure mimics real-world applications that require additional backend functionality.
IV. Implementation
Diving into the granular implementation details may not serve the primary goal of this project. Therefore, we will explore the application components, focusing on the parts relevant to interacting with the LLM.
1. Frontend with React
The React frontend collects LLM input and displays output, notably handling the response as a stream rather than static JSON.
In order to efficiently handle and display real-time updates from the LLM, we use EventSource
. This allows us to manage server-sent events (SSE) and handle streaming data directly in the frontend, providing a more interactive user experience. Instead of waiting for the entire response to be fully generated, we can display the response gradually. This ensures that the user doesn't have to wait long periods for the response to be ready, enhancing the responsiveness and overall user experience.
Concretely, receiving the response stream can occur in a useEffect()
hook as illustrated in the simplified code snippet below:
1//...
2const SummaryComponent = ({ isConnected, isLoading, isPaused }) => {
3const [summary, setSummary] = useState('');
4// other states...
5const [isResponseComplete, setIsResponseComplete] = useState(false);
6let stream: EventSource | null = null;
7useEffect(() => {
8 const args = new URLSearchParams(queryParameters);
9 if (isConnected && isLoading) {
10 stream = new EventSource(`${API_URL}/summarize?${args}`);
11 stream.onmessage = (ev) => {
12 if (!isPaused) {
13 setSummary(prevSummary => prevSummary + ev.data);
14 }
15 };
16 stream.onerror = (ev) => {
17 // handle error
18 };
19 }
20 return () => {
21 if (stream) {
22 //update states
23 }
24 };
25}, [isConnected, isLoading, isPaused]);
26//...
2. Backend with FastAPI
The main purpose of the backend is to expose an endpoint that interacts with the LLM server. Two key aspects are generating the prompt sent to the LLM server and receiving the response as a stream.
1@app.get("/summarize")
2...
3config = Configuration(temperature=temperature)
4prompt = SummarizerService.generate_prompt(book_title, parameters)
5response_content = StreamingResponse(summarizer.generate_summary(prompt, config=config), media_type="text/event-stream")
6...
Generating the prompt
The generate_prompt
function is responsible for constructing a detailed query to be processed by the LLM. This involves taking user inputs, such as the book title and additional parameters, and adding any necessary metadata to enhance the quality and relevance of the response. The assembled prompt ensures that the LLM understands the context and the specific requirements of the query. The following implementation aims to concretize what have been stated
1def generate_prompt(book_title: str, params: list[str] = None) -> str:
2 core_prompt = f"summarize the book '{book_title}'\n"
3 default_params = """
4 - Summary should not exceed 20 lines
5 - Summary should not have markdown
6 - Summary should not start with 'The book ...'. Summarize as if you are telling a story
7 - Summary should be written in well defined paragraphs without titles
8 - Summary should be suitable to deliver directly to my end-user."""
9 default_prompt = f"{core_prompt}{default_params}"
10 if not params:
11 return default_prompt
12 custom_prompt = core_prompt + "\n".join(params)
13 return custom_prompt
An interesting question that arises is, “How were the default parameters selected?” This was a process of trial and error. For example, the directive “Summary should not start with 'The book ...'. Summarize as if you are telling a story” was added after noticing that the initial results consistently began with “The book [book_title] …”. I decided that a more engaging summary should read like a narrative, hence the addition of this directive. This approach extends to other directives as well.
One common method is to include metadata that accurately describes the context and uses placeholders in the prompt that can be replaced with user data as needed. This ensures that the LLM has the necessary context to generate a high-quality and relevant response.
As a bonus feature, the backend allows users to download summaries locally as .txt files or save them directly to Google Cloud Storage. Currently, the data is stored in a bucket accessible only to the developers, but this implementation provided a valuable learning opportunity to understand how to integrate cloud storage solutions securely and efficiently into a web application.
3. Ollama
Ollama is a versatile, self-hosted tool designed to interface with various large language models (LLMs). Initially, Ollama has no model loaded, and models such as LLama3.2 need to be pulled using commands like ollama pull llama3.2
.
It's important to note that we could have used any generic model like gpt-4o. The application implementation follows a component-based approach, meaning that the used LLM can be changed in a matter of minutes. This flexibility is achieved by encapsulating the interaction between the backend and Ollama server within the SummarizerService
. At the core of this service is the generate_summary
method.
LLM_MODEL = "llama3.2" #This can be any model supported by Ollama
async def generate_summary(self, prompt: str, config: Configuration):
stream = ollama.chat(
model=LLM_MODEL,
messages=[{'model': LLM_MODEL, 'role': 'user', 'content': prompt}],
stream=True,
options={'temperature': config.temperature},
)
for chunk in stream:
yield f"data: {chunk['message']['content']}\n\n"
await asyncio.sleep(0.5)
We won’t go through the details of the code since most of it is self-explanatory, but it's worth pointing out the stream=True
parameter, which enables real-time streaming of responses. The LLM's response is streamed back chunk by chunk, allowing the frontend to display the summary in real time as it's being generated. Each chunk of data is prefixed with "data: " since this is the format expected by Server-Sent-Events on the frontend.
V. Testing Large Language Models
When thinking about LLMs in the context of testing, the obvious question that comes to mind would be “How to test?”. The question is particularly interesting due to the non-deterministic nature of LLM's output results. If we think about the Given-When-Then testing style, we cant figure out a setup that always guarantees the same result. We can mock this behavior but can’t expect this from the model.
A crucial aspect to highlight is the differentiation between testing and evaluation. Although these terms are often used interchangeably, they refer to distinct activities serving different purposes. Testing ensures the model functions correctly within an application, while evaluation measures the model's performance and accuracy in generating content. Both are essential for the effective use of LLMs, but this blogpost will focus solely on testing.
We want to perform tests within the context of the LLM application, rather than testing the model in isolation, especially when working with a generic model. Unlike classical testing, achieving high coverage in this scenario is challenging because it is not possible to deterministically define all possible outcomes. For instance, users can add a custom list of parameters, resulting in an infinite number of potential results.
Therefore, it is inefficient to attempt testing all possible cases. Instead, we can focus on testing the default parameters within the context of our application. Particularly testable parameters include:
- Summary should not exceed 20 lines
- Summary should not start with 'The book ...'.
When using a paid LLM, testing in this manner incurs costs, as each test run sends a request to the LLM, which might not be cost-efficient.
For those interested in learning about evaluation techniques, I recommend the following two blog posts 1, 2
While we've covered the basics of implementing an LLM-based application, there's more to explore. If you're intrigued, I encourage you to read the Bonus Chapter for additional insights on the remaining aspects of the process.
VI. Bonus
While the primary objective of this project was to gain hands-on experience in building LLM applications, I aimed to extract additional value from the process. As a junior developer and consultant, I simulated a comprehensive software development lifecycle to maximize learning opportunities and practical insights
1. Deployment
The initial deployment strategy involved a single-container approach on Google Cloud Run. This unified setup packed the web application and Ollama server into one Docker image using a multi-stage build process. The container incorporated frontend and backend services, along with all necessary dependencies and configurations.
However, this single-container deployment strategy presented significant challenges, both in terms of performance and architectural design. Running the web app and the resource-intensive Ollama server within the same container not only created a severe resource bottleneck but also violated the principle of separation of concerns. This configuration led to slower application performance and inefficiencies as both processes competed for shared resources. Moreover, it reduced scalability and made maintenance more complex, as updates or issues in one component could potentially affect the entire system.
To address these issues, I adopted a multi-container strategy, deploying the Ollama server and web app in separate Google Cloud Run instances. This approach allowed for independent optimization and scaling of each service based on its specific resource requirements. The separation resulted in more efficient resource allocation, significantly enhancing overall application performance.
2. Continuous Integration/Continuous Delivery/Continuous Deployment
There is nothing permanent except change. — Heraclitus
This ancient wisdom perfectly encapsulates the software development lifecycle. From an initial blank slate or basic boilerplate, an application evolves through countless iterations of code additions, bug fixes, and feature enhancements. Each change, no matter how small, contributes to the application's growth towards a production-ready state.
The journey from development to production hinges on effectively managing these changes through a critical triad of processes: continuous integration, delivery, and deployment.
GitHub Actions pipelines efficiently manage this process. The Build and Test pipeline activates when changes are proposed to the main branch, typically through feature branch merges. It builds the application and runs tests, ensuring new changes don't compromise existing functionality. A successful pipeline at this stage only confirms error-free builds and passed tests.
The process advances with a Deploy pipeline, which pushes built Docker images and deploys them to Google Cloud Run. This step transitions from delivering changes to deploying them in a live environment. Given the project's exploratory nature, testing is limited and doesn't include comprehensive checks like smoke tests on the built artifact.
3. Automated dependencies update
I have always struggled with “dependency hell”, where keeping up with dependency updates becomes an overwhelming task. This challenge is particularly pronounced in the JavaScript ecosystem, where changes occur at an exceptionally rapid pace. You might wonder, why bother updating dependencies if the software is working as intended? The answer is straightforward: newer versions of dependencies often bring improvements, although this should be assessed cautiously. While it's fair to be content with a current version, the wise decision is to update to a newer one when the current version presents security vulnerabilities.
Having established the necessity of regularly updating dependencies, let's discuss the most effective method I've found to manage this task. While you could set aside time to manually update dependencies, this approach can be time-consuming and frustrating, especially when dealing with interdependent libraries where finding the correct versions can be challenging. As a developer, you likely prefer to focus on more productive and engaging tasks. The solution I've found highly effective is to use a tool that tracks and automatically updates dependencies at a configurable frequency. For this project, I've set up Renovate in my repository to handle updates for both npm packages and pip dependencies. Renovate scans the repository and automatically creates PRs to update dependencies (such as package.json for JS/TS projects, though Renovate supports a wide range of dependency managers). You can see the full list of supported managers at https://docs.renovatebot.com/modules/manager/
Following up on the effectiveness of Renovate, another significant advantage is its high level of configurability. For instance, you can configure Renovate to delay updates until a certain amount of time has passed or until a certain adoption rate is observed. This is particularly useful because the latest version of a dependency might introduce its own issues that will be resolved in subsequent patches. In other words, it's not always optimal to update immediately, so being able to define strict rules to control update timing adds an important layer of flexibility and stability to your project.
Observability
When building a real-world API, it's essential to monitor API calls to track requests and identify errors. However, these metrics alone aren't sufficient for an LLM-based application, as they don't provide the detailed information needed to assess the quality or performance of the LLM. To truly understand and optimize our LLM, we need comprehensive observability that captures every relevant detail.
To achieve this, I utilized OpenLLMetry, an open-source project that allows easy monitoring and debugging of LLM app execution. Tracing is done in a non-intrusive way, built on top of OpenTelemetry. The traces can be exported to Traceloop or to another observability stack(e.g: Sentry or Signoz).
In our project, we can track method calls by annotating them, but most importantly, we can obtain detailed information about our calls to the LLM model. After configuring Traceloop, we can observe the prompts and key metrics that provide comprehensive data for evaluating our models' performance.
Thank you for following along and I hope these insights help you build and optimize your own LLM-based applications.
Your job at codecentric?
Jobs
Agile Developer und Consultant (w/d/m)
Alle Standorte
More articles in this subject area
Discover exciting further topics and let the codecentric world inspire you.
Gemeinsam bessere Projekte umsetzen.
Wir helfen deinem Unternehmen.
Du stehst vor einer großen IT-Herausforderung? Wir sorgen für eine maßgeschneiderte Unterstützung. Informiere dich jetzt.
Hilf uns, noch besser zu werden.
Wir sind immer auf der Suche nach neuen Talenten. Auch für dich ist die passende Stelle dabei.
Blog author
Mouadh Khlifi
Developer & IT Consultant
Do you still have questions? Just send me a message.
Do you still have questions? Just send me a message.