Simplifying LLM Application Development: A Newcomer's Perspective

6.12.2024 | 13 minutes reading time

I. Introduction

Large Language Models (LLMs) have become highly popular due to their transformative impact on various fields, especially within IT. They enable developers to create innovative software applications centered around AI interactions, offering new and sophisticated solutions to real-world problems.

LLMs can be used for AI-assisted development, which can drastically boost productivity in some cases. The goal is to leverage AI language models to change the development process forever. One could think of it as adding a layer of human language abstraction on top of high-level programming languages. However, this article will focus on the second impact of LLMs: utilizing them to develop software solutions that address real-world problems.

The goal is to leverage AI language models to design, build, test, deploy, and observe a software application centered around LLM capabilities. We will walk through the entire journey, from initial conceptualization to hands-on implementation, deployment, and monitoring of the application.

This blogpost aims to provide a low-barrier-to-entry introduction to building LLM-based applications, sharing both findings and the challenges encountered along the way.

II. Requirements

This project aims to explore LLM interaction mechanics firsthand, prioritizing understanding over extensive features. The requirements will be kept minimal to focus on the core LLM interaction process.

As an introductory example of utilizing an LLM, we will implement a basic Book Summarizer. The application will allow users to interact with a model to get a summarized version of a book. To keep the scope concise and centered on LLM interactions, we will only include requirements that necessitate interaction with the AI model. Here is the refined list of requirements:

Get Book Summary: Users should be able to input a book title and receive a summary generated by the LLM.
Adjust Temperature Parameter: Users should be able to set the temperature parameter to influence the creativity and variability of the output.
Add Custom Parameters: Users should be able to add custom parameters to the query for more tailored responses.

This minimalist approach will demonstrate essential LLM integration in a web application.

III. Architectural Overview

Starting from the narrow requirements list, we can already outline the application's architecture. Figure 1: Architectural overview of the book summarizer application

As demonstrated in Figure 1, the summary (which can be abstracted to any desired output) is generated following this workflow:

User Interaction: The user enters the book title in a text field and adds parameters if desired. This payload is then sent to the backend application.
Backend Processing: The backend application processes the prompt and parameters to generate a more comprehensive prompt with added metadata crucial to the quality of the output. The backend does not have GenAI capabilities but can interact with an LLM. In this instance, a self-hosted model is used, although any other model like OpenAI’s or Anthropic’s could be used.
LLM Interaction: The Ollama server receives the payload and begins generating a response stream based on the pulled model.
Frontend Update: The response stream is chunked and sent to the frontend application to update its UI, specifically the field where the result should be displayed.

FastAPI serves as a backend intermediary between the frontend and the LLM. While direct frontend-LLM interaction is possible, this structure mimics real-world applications that require additional backend functionality.

IV. Implementation

Diving into the granular implementation details may not serve the primary goal of this project. Therefore, we will explore the application components, focusing on the parts relevant to interacting with the LLM.

1. Frontend with React

The React frontend collects LLM input and displays output, notably handling the response as a stream rather than static JSON.

In order to efficiently handle and display real-time updates from the LLM, we use EventSource . This allows us to manage server-sent events (SSE) and handle streaming data directly in the frontend, providing a more interactive user experience. Instead of waiting for the entire response to be fully generated, we can display the response gradually. This ensures that the user doesn't have to wait long periods for the response to be ready, enhancing the responsiveness and overall user experience.

Concretely, receiving the response stream can occur in a useEffect() hook as illustrated in the simplified code snippet below:

1//...
2const SummaryComponent = ({ isConnected, isLoading, isPaused }) => {
3const [summary, setSummary] = useState('');
4// other states... 
5const [isResponseComplete, setIsResponseComplete] = useState(false);
6let stream: EventSource | null = null;
7useEffect(() => {
8    const args = new URLSearchParams(queryParameters);
9    if (isConnected && isLoading) {
10        stream = new EventSource(`${API_URL}/summarize?${args}`);
11        stream.onmessage = (ev) => {
12            if (!isPaused) {
13                setSummary(prevSummary => prevSummary + ev.data);
14            }
15        };
16        stream.onerror = (ev) => {
17            // handle error
18        };
19    }
20    return () => {
21        if (stream) {
22            //update states
23        }
24    };
25}, [isConnected, isLoading, isPaused]);
26//...

2. Backend with FastAPI

The main purpose of the backend is to expose an endpoint that interacts with the LLM server. Two key aspects are generating the prompt sent to the LLM server and receiving the response as a stream.

1@app.get("/summarize")
2...
3config = Configuration(temperature=temperature)
4prompt = SummarizerService.generate_prompt(book_title, parameters)
5response_content = StreamingResponse(summarizer.generate_summary(prompt, config=config), media_type="text/event-stream")
6...

Generating the prompt

The generate_prompt function is responsible for constructing a detailed query to be processed by the LLM. This involves taking user inputs, such as the book title and additional parameters, and adding any necessary metadata to enhance the quality and relevance of the response. The assembled prompt ensures that the LLM understands the context and the specific requirements of the query. The following implementation aims to concretize what have been stated

1def generate_prompt(book_title: str, params: list[str] = None) -> str:
2        core_prompt = f"summarize the book '{book_title}'\n"
3        default_params = """
4            - Summary should not exceed 20 lines
5            - Summary should not have markdown
6            - Summary should not start with 'The book ...'. Summarize as if you are telling a story
7            - Summary should be written in well defined paragraphs without titles
8            - Summary should be suitable to deliver directly to my end-user."""
9        default_prompt = f"{core_prompt}{default_params}"
10        if not params:
11            return default_prompt
12        custom_prompt = core_prompt + "\n".join(params)
13        return custom_prompt

An interesting question that arises is, “How were the default parameters selected?” This was a process of trial and error. For example, the directive “Summary should not start with 'The book ...'. Summarize as if you are telling a story” was added after noticing that the initial results consistently began with “The book [book_title] …”. I decided that a more engaging summary should read like a narrative, hence the addition of this directive. This approach extends to other directives as well.

One common method is to include metadata that accurately describes the context and uses placeholders in the prompt that can be replaced with user data as needed. This ensures that the LLM has the necessary context to generate a high-quality and relevant response.

As a bonus feature, the backend allows users to download summaries locally as .txt files or save them directly to Google Cloud Storage. Currently, the data is stored in a bucket accessible only to the developers, but this implementation provided a valuable learning opportunity to understand how to integrate cloud storage solutions securely and efficiently into a web application.

3. Ollama

Ollama is a versatile, self-hosted tool designed to interface with various large language models (LLMs). Initially, Ollama has no model loaded, and models such as LLama3.2 need to be pulled using commands like ollama pull llama3.2.

It's important to note that we could have used any generic model like gpt-4o. The application implementation follows a component-based approach, meaning that the used LLM can be changed in a matter of minutes. This flexibility is achieved by encapsulating the interaction between the backend and Ollama server within the SummarizerService. At the core of this service is the generate_summary method.

1LLM_MODEL = "llama3.2" #This can be any model supported by Ollama
2async def generate_summary(self, prompt: str, config: Configuration):
3    stream = ollama.chat(
4        model=LLM_MODEL,
5        messages=[{'model': LLM_MODEL, 'role': 'user', 'content': prompt}],
6        stream=True,
7        options={'temperature': config.temperature},
8    )
9    for chunk in stream:
10        yield f"data: {chunk['message']['content']}\n\n"
11        await asyncio.sleep(0.5)

We won’t go through the details of the code since most of it is self-explanatory, but it's worth pointing out the stream=True parameter, which enables real-time streaming of responses. The LLM's response is streamed back chunk by chunk, allowing the frontend to display the summary in real time as it's being generated. Each chunk of data is prefixed with "data: " since this is the format expected by Server-Sent-Events on the frontend.

V. Testing Large Language Models

When thinking about LLMs in the context of testing, the obvious question that comes to mind would be “How to test?”. The question is particularly interesting due to the non-deterministic nature of LLM's output results. If we think about the Given-When-Then testing style, we cant figure out a setup that always guarantees the same result. We can mock this behavior but can’t expect this from the model.

A crucial aspect to highlight is the differentiation between testing and evaluation. Although these terms are often used interchangeably, they refer to distinct activities serving different purposes. Testing ensures the model functions correctly within an application, while evaluation measures the model's performance and accuracy in generating content. Both are essential for the effective use of LLMs, but this blogpost will focus solely on testing.

We want to perform tests within the context of the LLM application, rather than testing the model in isolation, especially when working with a generic model. Unlike classical testing, achieving high coverage in this scenario is challenging because it is not possible to deterministically define all possible outcomes. For instance, users can add a custom list of parameters, resulting in an infinite number of potential results.

Therefore, it is inefficient to attempt testing all possible cases. Instead, we can focus on testing the default parameters within the context of our application. Particularly testable parameters include:

Summary should not exceed 20 lines
Summary should not start with 'The book ...'.

When using a paid LLM, testing in this manner incurs costs, as each test run sends a request to the LLM, which might not be cost-efficient.

For those interested in learning about evaluation techniques, I recommend the following two blog posts 1, 2

While we've covered the basics of implementing an LLM-based application, there's more to explore. If you're intrigued, I encourage you to read the Bonus Chapter for additional insights on the remaining aspects of the process.

VI. Bonus

While the primary objective of this project was to gain hands-on experience in building LLM applications, I aimed to extract additional value from the process. As a junior developer and consultant, I simulated a comprehensive software development lifecycle to maximize learning opportunities and practical insights

1. Deployment

The initial deployment strategy involved a single-container approach on Google Cloud Run. This unified setup packed the web application and Ollama server into one Docker image using a multi-stage build process. The container incorporated frontend and backend services, along with all necessary dependencies and configurations.

However, this single-container deployment strategy presented significant challenges, both in terms of performance and architectural design. Running the web app and the resource-intensive Ollama server within the same container not only created a severe resource bottleneck but also violated the principle of separation of concerns. This configuration led to slower application performance and inefficiencies as both processes competed for shared resources. Moreover, it reduced scalability and made maintenance more complex, as updates or issues in one component could potentially affect the entire system.

To address these issues, I adopted a multi-container strategy, deploying the Ollama server and web app in separate Google Cloud Run instances. This approach allowed for independent optimization and scaling of each service based on its specific resource requirements. The separation resulted in more efficient resource allocation, significantly enhancing overall application performance.

2. Continuous Integration/Continuous Delivery/Continuous Deployment

There is nothing permanent except change. — Heraclitus

This ancient wisdom perfectly encapsulates the software development lifecycle. From an initial blank slate or basic boilerplate, an application evolves through countless iterations of code additions, bug fixes, and feature enhancements. Each change, no matter how small, contributes to the application's growth towards a production-ready state.

The journey from development to production hinges on effectively managing these changes through a critical triad of processes: continuous integration, delivery, and deployment.

GitHub Actions pipelines efficiently manage this process. The Build and Test pipeline activates when changes are proposed to the main branch, typically through feature branch merges. It builds the application and runs tests, ensuring new changes don't compromise existing functionality. A successful pipeline at this stage only confirms error-free builds and passed tests.

The process advances with a Deploy pipeline, which pushes built Docker images and deploys them to Google Cloud Run. This step transitions from delivering changes to deploying them in a live environment. Given the project's exploratory nature, testing is limited and doesn't include comprehensive checks like smoke tests on the built artifact.

3. Automated dependencies update

I have always struggled with “dependency hell”, where keeping up with dependency updates becomes an overwhelming task. This challenge is particularly pronounced in the JavaScript ecosystem, where changes occur at an exceptionally rapid pace. You might wonder, why bother updating dependencies if the software is working as intended? The answer is straightforward: newer versions of dependencies often bring improvements, although this should be assessed cautiously. While it's fair to be content with a current version, the wise decision is to update to a newer one when the current version presents security vulnerabilities.

Having established the necessity of regularly updating dependencies, let's discuss the most effective method I've found to manage this task. While you could set aside time to manually update dependencies, this approach can be time-consuming and frustrating, especially when dealing with interdependent libraries where finding the correct versions can be challenging. As a developer, you likely prefer to focus on more productive and engaging tasks. The solution I've found highly effective is to use a tool that tracks and automatically updates dependencies at a configurable frequency. For this project, I've set up Renovate in my repository to handle updates for both npm packages and pip dependencies. Renovate scans the repository and automatically creates PRs to update dependencies (such as package.json for JS/TS projects, though Renovate supports a wide range of dependency managers). You can see the full list of supported managers at https://docs.renovatebot.com/modules/manager/

Following up on the effectiveness of Renovate, another significant advantage is its high level of configurability. For instance, you can configure Renovate to delay updates until a certain amount of time has passed or until a certain adoption rate is observed. This is particularly useful because the latest version of a dependency might introduce its own issues that will be resolved in subsequent patches. In other words, it's not always optimal to update immediately, so being able to define strict rules to control update timing adds an important layer of flexibility and stability to your project.

Observability

When building a real-world API, it's essential to monitor API calls to track requests and identify errors. However, these metrics alone aren't sufficient for an LLM-based application, as they don't provide the detailed information needed to assess the quality or performance of the LLM. To truly understand and optimize our LLM, we need comprehensive observability that captures every relevant detail.

To achieve this, I utilized OpenLLMetry, an open-source project that allows easy monitoring and debugging of LLM app execution. Tracing is done in a non-intrusive way, built on top of OpenTelemetry. The traces can be exported to Traceloop or to another observability stack(e.g: Sentry or Signoz).

In our project, we can track method calls by annotating them, but most importantly, we can obtain detailed information about our calls to the LLM model. After configuring Traceloop, we can observe the prompts and key metrics that provide comprehensive data for evaluating our models' performance.

Thank you for following along and I hope these insights help you build and optimize your own LLM-based applications.

Was this post helpful?

Blog author

Mouadh Khlifi

Do you still have questions? Just send me a message.

Your job at codecentric?

Jobs

Agile Developer und Consultant (w/d/m)

Alle Standorte

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

On January 27, 2025, the technology stock exchange experienced an unexpected crash: The NVIDIA stock price plummeted by over 17%, temporarily wiping out nearly $600 billion in market value and setting a new historical record in the stock market. Many...

AI
Generative AI
LLM

29.1.2025 | 8 minutes reading time

How we can hack an AI with just a few words

How we can hack an AI with just a few words Artificial intelligence (AI) has undergone an astonishing transformation in recent years and is now present in many areas of life. Whether in the form of chatbots that help us with everyday questions or generative...

IT-Security
AI

27.1.2025 | 4 minutes reading time

Function Calling with GPT Models

GenAI is a powerful tool for generating content and interacting with applications using natural language. However, this tool also has significant limitations when you plan to use it in your own software. GenAI's knowledge is limited to information that...

Generative AI
AI
LLM

6.9.2024 | 5 minutes reading time

How to program my LLM with Prompt Engineering

When developing a feature powered by LLMs, it is essential to make the most use of Prompt Engineering. A well designed prompt written in the “system” role of the LLM (more information here: https://www.codecentric.de/wissens-hub/blog/accessing-llms-in...

LLM
Generative AI

19.6.2024 | 8 minutes reading time

Accessing LLMs in Code – Automating LLM Calls

Hardly any technology has had such an impact in recent years as LLMs – with ChatGPT from OpenAI leading the way. Many media outlets are intensely engaged in how this tool can be used for personal and business purposes. Another aspect, which receives ...

LLM
Generative AI

30.5.2024 | 6 minutes reading time

Answer questions about your documents with OpenAI and Pinecone

In recent years, large language models (LLMs) have made remarkable progress in interacting with humans, showcasing their ability to answer a wide array of questions. Trained on publicly accessible internet content, these models have broad knowledge across...

13.11.2023 | 12 minutes reading time

Lukas Lehmann

Fighting Gandalf with magic spells (the spells are prompt injections) ...

Note: Do not attack any systems for which you do not have explicit permission to do so. In this article, I will recount the tale of outwitting a large language model by performing prompt injection attacks. Before we start, let's establish a common baseline...

IT-Security
AI

10.7.2023 | 12 minutes reading time

Michael Wagner

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

In this article, we'll explore how to use the Poetry package manager to manage the dependencies of a machine learning project that makes use of the M1 GPU for TensorFlow training. We'll cover the motivation for using Poetry in this context, and we'll...

Machine Learning
Apple
Data
AI
Python

11.1.2023 | 3 minutes reading time

Denis Stalz-John

How to use Java classes in Python

There is an old truism: “Use the right tool for the job.” However, in building software, we are often forced to nail in screws, just because the rest of the application was built with the figurative hammer Java. Of course, one of the preferred solutions...

AI
Java
Python

15.11.2021 | 8 minutes reading time

Hendrik Schawe

The universal recommender in Action(ML)

IntroductionRecommender systems have become crucial for many different businesses. E-commerce uses recommenders to guide their customers in finding the right products and to assure they stay on the site. Newspapers or entertainment websites want to keep...

AI
NoSQL
Data
Machine Learning
Python

18.4.2021 | 11 minutes reading time

Francesca Diana

NER with little data? Transformers to the rescue!

How do you solve deep learning problems with too little labelled data? The answer, of course, is transfer learning. In this post, we will apply this concept to named entity recognition (NER) andfine-tune a pre-trained BERT to extract information from...

Data
Machine Learning
AI
NLP
Agile transformation

14.12.2020 | 8 minutes reading time

Take control of named entity recognition with your own Keras model!

This post shows how to extract information from text documents with the high-level deep learning library Keras : we build, train and evaluate a bidirectional LSTM model by hand for a custom named entity recognition (NER) task on legal texts.In a previous...

Data
Python
AI
NLP
Machine Learning

13.11.2020 | 9 minutes reading time

NER @ CLI: Custom-named entity recognition with spaCy in four lines

Named entity recognition is a technical term for a solution to a key automation problem: extraction of information from text. Applications includeautomation of business processes involving documentsdistillation of data from the web by scraping websitesindexing...

Data
AI
NLP
Machine Learning

6.11.2020 | 9 minutes reading time

DISH-O-TRON – Train that vision model!

With this article we continue our endeavor of building dish-o-tron – an AI system designed to prevent the sudden appearance of dirty dishes in the community kitchen sink, and hence turning the community kitchen into a place of peace and harmony.This ...

AI
Computer Vision

11.10.2020 | 11 minutes reading time

Marcel Mikl

DISH-O-TRON – Gather that DATA you must!

This is the second article in our dish-o-tron series (a non-standard Deep Learning tutorial) in which we tackle one of the biggest problems in community kitchens: coming across someone else’s dirty dishes. We are facing this problem by building a state...

AI
Computer Vision
Machine Learning

24.9.2020 | 11 minutes reading time

Marcel Mikl

DISH-O-TRON – No more dirty dishes thanks to AI

Sadly, to tell you the truth, doing dishes is still a thing. However, so far most of our readers still like our non-standard Deep Learning tutorial.Typically, AI is demonstrated as solving various toy problems. AI plays chess and Go, AI plays video games...

10.9.2020 | 7 minutes reading time

Marcel Mikl

Why user-oriented development is so important – the story of tactics.ai

In this blog post, we want to give you an insight into the product development of tactics.ai. Our initial idea was a data-driven football analysis tool that applies machine learning techniques to analyze the strengths and weaknesses of opponents and ...

Agile
AI
Startup
Machine Learning
Product management

23.8.2020 | 8 minutes reading time

Denis Stalz-John

Thinking AI means re-thinking data

While doing AI is sexy and cool, data infrastructure is typically not considered any of this. However, production-grade machine learning applications heavily rely on proper data infrastructure. Hence, in order to generate actual business value, solid...

AI
Big Data
Data
Machine Learning

27.5.2020 | 7 minutes reading time

Marcel Mikl

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Kofax Transformation Modules (KTM) offers several tools for document classification and data extraction. There are some older blog articles about these tools:– Document classification – Data extraction with format locators – Machine Learning The ...

Content Management
AI
Archiving
NLP

6.4.2020 | 8 minutes reading time

Physical regression testing for the Thermomix

Automating physical regression testing of products with computer vision and roboticsTesting a physical product can be a highly manual task. The advances in Deep Learning techniques and computer vision have led to a situation where we can start to strive...

AWS
IoT
Computer Vision
Product management
AI
Testing

31.3.2020 | 8 minutes reading time

Simplifying LLM Application Development: A Newcomer's Perspective

I. Introduction

II. Requirements

III. Architectural Overview

IV. Implementation

1. Frontend with React

2. Backend with FastAPI

3. Ollama

V. Testing Large Language Models

VI. Bonus

1. Deployment

2. Continuous Integration/Continuous Delivery/Continuous Deployment

3. Automated dependencies update

Observability

Was this post helpful?

Blog author

Your job at codecentric?

Agile Developer und Consultant (w/d/m)

More articles in this subject area

Open Source hits Billion-Dollar Market: DeepSeek-R1 is shaking up the ...

How we can hack an AI with just a few words

Function Calling with GPT Models

How to program my LLM with Prompt Engineering

Accessing LLMs in Code – Automating LLM Calls

Answer questions about your documents with OpenAI and Pinecone

Fighting Gandalf with magic spells (the spells are prompt injections) ...

How to combine Poetry, TensorFlow, and the power of the Apple M1 GPU

How to use Java classes in Python

The universal recommender in Action(ML)

NER with little data? Transformers to the rescue!

Take control of named entity recognition with your own Keras model!

NER @ CLI: Custom-named entity recognition with spaCy in four lines

DISH-O-TRON – Train that vision model!

DISH-O-TRON – Gather that DATA you must!

DISH-O-TRON – No more dirty dishes thanks to AI

Why user-oriented development is so important – the story of tactics.ai

Thinking AI means re-thinking data

Kofax Transformation Modules: Natural Language Processing, sentiments ...

Physical regression testing for the Thermomix