Artificial Intelligence (AI), particularly Large Language Models (LLMs), has revolutionised various sectors by allowing organisations to tackle complex problems and perform tasks efficiently. However, as organisations increasingly adopt LLMs, the need to understand their behaviour in a production environment and use this understanding to improve their development has become apparent.
The solution to these challenges lies in software observability, which refers to the ability to understand an application's behaviour based on the telemetry data it generates at runtime. The complexity of modern software systems, including LLMs, necessitates using observability tools and practices to manage their complexity and unpredictability.
LLMs have transformed the way organisations approach machine learning (ML). They offer solutions for complex problems, making them an accessible tool for any product engineering team. However, the very features that make LLMs so appealing also present significant challenges, particularly regarding reliability and predictability.
The Challenges of LLMs
LLMs are essentially black boxes. Their outputs are nondeterministic and depend on natural language inputs, which are inherently broad and unpredictable. This means that users of applications with natural language inputs will inevitably do unexpected things. Debugging an LLM takes a lot of work. You can likely explain why an LLM produces a particular output for a given input if you're an ML researcher.
Testing LLMs
Traditional unit and integration testing, verifying that specific inputs yield specific outputs, are ineffective with LLMs. The range of possible outputs is vast, making it impossible to test all potential inputs and scenarios exhaustively. Instead, ML teams often build evaluation systems that can assess the effectiveness of a model.
De-risking Product Launch
Attempts to de-risk a product launch through early access programs or limited user testing can introduce bias and create a false sense of security. These programs must often capture the full range of user behaviour and potential edge cases in real-world usage. Instead, it's better to embrace a "ship to learn" mentality and release features earlier, but you need a way to systematically "learn" from what was shipped.
Observability in LLMs
To deal with the challenges LLMs pose, engineering teams have turned to observability as a better way to debug, monitor, and use data from production to inform product improvements. Teams can manage modern systems by collecting relevant information about their applications from within their code and systematically analysing and monitoring this data. This principle can apply equally to products that use modern AI systems.
Prompt Engineering
Prompt engineering, also known as prompting, is a collection of techniques to guide a Language Model (LLM) to generate desired outputs without modifying the model. The primary communication method with the model is through one or more textual inputs, which can include instructions, data, user inputs, example outputs, and more.
Consider a scenario where you use an LLM to generate a SQL query based on natural language input. The prompt for this task could include the following text, with several placeholders that parameterise the relevant data for the task:
-- You are an AI that turns natural language input into SQL queries. Given user input, the table and its columns produce an SQL statement.
-- Input: Get all order that Bob has made this year.
-- Output:
select * from orders where customer = 'Bob' and year(order_date) = '2023'
Prompts offer a flexible way to influence the behaviour and output of an LLM, allowing the generated text to be customised for specific tasks, styles, or domains. However, prompt engineering is a subtle and nuanced process. Even minor changes to the prompts can lead to significant differences in the outputs produced by the model.
OpenTelemetry and (W3C) Trace Context
OpenTelemetry, an open standard for telemetry creation, offers a unified framework for capturing and collecting observability data. This is instrumental in gathering insightful data about LLM-based products' user behaviour and system performance.
One of the primary data types supported by OpenTelemetry is a trace. A trace, a distributed trace, is a set of structured logs, known as spans, linked by an ID and assigned a duration. Spans can also identify a parent span, enabling the representation of a hierarchy of operations in data.
To leverage OpenTelemetry for modern AI observability, several steps need to be taken:
Automatic Instrumentation Installation: Install automatic instrumentation or a relevant instrumentation library to monitor incoming and outgoing requests. This enables tracking external API call behaviour, such as calls to OpenAI, and correlating that information with a user request to your applications.
OpenTelemetry SDK Installation: Install the appropriate OpenTelemetry Software Development Kit (SDK) for your language into your codebase. Use the OpenTelemetry APIs to create manual instrumentation that captures relevant data and operations before and after a call to a generative AI model.
By integrating OpenTelemetry's automatic tracing instrumentation capabilities with manual instrumentation, you can capture all the necessary data to start systematically analysing user behaviour. This approach allows you to understand how user behaviour influences the results produced by a generative AI model.
Cost Management in Language Model Services
When utilising external services such as OpenAI, it's crucial to understand the cost implications associated with each API call. These services typically provide mechanisms to monitor usage daily or monthly, catering to most organisations' needs. However, for those requiring more detailed insights, there are methods to track costs more granularly.
Token-Based Cost Tracking
In the context of large language models (LLMs), the concept of 'tokens' is central to understanding cost. Tokens are encoded representations of input and output text. When you provide input text to an LLM, it's encoded into a list of tokens, efficiently representing the text. The LLM responds by emitting tokens, which are then decoded into the response text.
Cost Considerations
While cost tracking is essential, it's worth noting that it's rarely the primary concern for users of LLMs. The cost of using most LLMs is relatively low, and trends suggest that they will become even more affordable over time due to increased efficiency and competitive pressures.
Rate Limiting and Cost Estimation
When using an LLM, rate limiting can also be applied at the application level. This, combined with the ability to calculate costs based on token usage and vendor rate limits, simplifies estimating your monthly bill. This approach ensures that you can effectively manage your costs while leveraging the powerful capabilities of LLMs.
Understanding and Optimising Large Language Models (LLMs) and Generative AI Systems
Large Language Models (LLMs) and generative AI systems are complex, nondeterministic entities that can handle various inputs. To effectively monitor and optimise these systems, it is crucial to track both inputs and outputs systematically. This process involves five key components:
- User inputs
- LLM outputs
- Data values post parsing/validation of LLM outputs (assuming no errors)
- Any errors, whether from LLM output or parsing/validation of the LLM output
- User feedback (e.g., thumbs up/down responses)
This assumes that a mechanism for tracking user feedback is integrated into your telemetry. While not mandatory, this feature significantly enhances your system's observability and prompt engineering efforts.
The Importance of Parsing and Validating LLM Outputs
Parsing and validating LLM outputs is a critical aspect of managing these systems. There are several reasons for this:
- Security: LLM outputs should be treated as untrusted inputs to your system. Parsing and validating these outputs can help mitigate potential security risks, such as prompt injection attacks.
- Versatility: Parsing and validating LLM outputs allows you to use these systems for various applications, not just basic chatbots. This process enables you to validate the outputs against a set of rules, which is crucial for using those outputs in other parts of an application or displaying them to users.
- Accuracy: Some prompt-engineering techniques involve having an LLM output piece of an answer that you manually assemble into the complete answer later. This approach can reduce the complexity of a task for the LLM and increase its accuracy.
- Error Handling: Parsing and validating LLM outputs can produce specific and often correctable errors. This allows for a set of "fixups" to the data the LLM returns, which can yield impressive results.
The Need to Capture Both LLM Outputs and Final Outputs
When you parse and validate LLM outputs, the final output (assuming there's no error) is often in a different format from what the LLM initially responds with. Therefore, capturing both the LLM outputs and your final outputs in your telemetry is crucial.
The Importance of Tracking All Errors
Tracking all errors, whether they arise from a network error, a timeout, the LLM itself, or the parsing and validation process, is crucial for understanding what kinds of user inputs can lead to errors. This information is vital for improving the user experience and identifying opportunities to correct an LLM output directly if it fails a parsing/validation step.
Analysing Inputs, Outputs, and Errors
With user inputs, LLM outputs, validation/parsing outputs, and errors at your disposal, you can start analysing them. Using an observability tool, write a query that groups requests by error and frequency. This will provide a prioritised list of issues to fix.
Monitoring API Call Performance for Language Learning Models (LLMs)
Understanding the performance of API calls to Language Learning Models (LLMs) is crucial for maintaining a high-quality user experience. This involves distinguishing between the latency and errors associated with these API calls and the complete operations involving an LLM.
Factors Influencing API Call Performance
Several factors can influence the latency and errors experienced with API calls to LLMs. These include:
- API Call Frequency: The number of API calls made per user request can significantly impact the performance. For instance, generating a vector embedding for each user input before calling an LLM can increase the number of API calls, potentially affecting the latency.
- Rate of API Calls: The frequency of API calls made per minute can also influence the performance. A high rate of calls can lead to rate limiting, timeouts, or errors due to resource unavailability.
- Token Count: The average number of tokens passed to an LLM for each request, and the average number of tokens received per request can affect the latency and error rate.
- Rate Limiting: The frequency of rate-limited requests can indicate whether the API calls are being throttled, leading to increased latency or errors.
- API Call Contribution to Overall Latency: Understanding the proportion of overall user-experienced latency due to an API call to an LLM can help identify areas for optimisation.
Understanding and Evaluating LLM API Performance
While these factors may not always provide direct action points, they are essential for comprehending the overall behaviour of your product. They can also serve as evaluation metrics for different LLM APIs and vendors, enabling you to effectively assess their ability to service your requests.
Monitoring and Service-Level Objectives
Service-level objectives (SLOs) are critical to any system's performance monitoring strategy. They provide a quantifiable measure of the system's performance, allowing teams to identify and address issues proactively. SLOs are typically defined as service-level indicators (SLIs), which are functions that return a binary, true or false, based on the measurement of specific data, such as requests to OpenAI.
Key Service-Level Indicators: Latency and Error Rates
When setting up SLOs for systems that involve Language Learning Models (LLMs), two fundamental SLIs to track are latency and error rates. While other SLOs could be beneficial, these two are often the most critical starting points.
Latency SLOs
Latency is the time it takes for a user to receive a result after initiating a request. It's crucial to monitor latency throughout the entire lifecycle of user interaction with a feature that uses LLMs. This includes the time taken to gather input, build or gather the prompt to an LLM, make additional API calls (such as fetching a vector embedding), make the call to an LLM, and parse/validate results.
Error Rate SLOs
The second SLO to monitor is error rates. This includes any error encountered during the process, whether it originates from an API call or parsing/validation of LLM results.
SLO Monitoring and Alerting
SLO alerts for LLMs should be non-urgent. They exist to inform so that a team can plan a corrective action rather than cause a team to halt everything and fix a problem. It's recommended to send these alerts to a messaging channel, such as one in Slack or Microsoft Teams, and never trigger an alert on platforms like PagerDuty. Unlike LLM SLO alerts, paging alerts should always be directly actionable.
Utilising Observability Data for Product Enhancement
Observability, an essential aspect of system monitoring, is crucial in developing and improving products, especially those utilising Language Learning Models (LLMs). Given that LLMs are not traditionally debuggable, the only way to identify improvement areas is by analysing user data. When collected in sufficient quantities, this data can reveal patterns where the product fails to meet user expectations, thus providing opportunities for iteration and improvement.
Leveraging Production Data for Iteration
Using production data for iteration may seem complex, but it's pretty straightforward. All it requires is an observability tool and the ability to interpret the results of a query against that data. The essential data points to consider are any existing error, the input provided by the user, the output of the LLM, and the output of parsing/validation, if it exists.
Addressing Correctable Errors
LLMs generate outputs that follow a specific structure rather than open-ended text in many products. When the structure output by an LLM is incorrect, you have the option to intervene directly on that error.
Building a Prompt Evaluation System Based on Production Data
In the long run, two primary ways to improve your product's use of an LLM are rigorous prompt engineering and fine-tuning an LLM. Both methods require a systematic way to quantify improvements in an LLM's performance, which can be achieved through an evaluation system.
Using Production Data to Power an Evaluation System
The most significant task in an evaluation system is building the dataset to be evaluated. Observability is critical to creating evaluation datasets because it gathers real-world user inputs and system outputs. The data used to power an evaluation system must be representative. If you need more unique data points to evaluate, you'll all gain a false sense of confidence in your evaluations.
A Collaborative Approach
Observability, particularly in the context of Large Language Models (LLMs), is not a solitary endeavour. It requires a collaborative effort from various roles within an organisation. This principle holds for all software observability. While roles such as Site Reliability Engineering (SRE), DevOps, and platform engineers often take the lead, enhancing software, including LLMs, necessitates the involvement of a diverse range of individuals.
The Role of Different Stakeholders in Observability
Regarding products that utilise LLMs, roles not traditionally associated with observability come into play. These include product managers, ML engineers, and data scientists. While having optional job titles in your team to get started is not mandatory, having individuals who can fulfil the responsibilities typically associated with these roles is crucial.
Key Areas of Understanding for Effective LLM Implementation
Certain areas of understanding and action within your organisation are essential to optimise the use of LLMs in production. These include:
- Instrumenting your application to emit the necessary telemetry
- Analysing telemetry with the intent to enhance a feature
- Monitoring telemetry to ensure your changes are effective
- Handling end-user feedback
- Understanding user expectations for your LLM-powered feature
- Knowing when your data will be representative
- Cleaning and classifying data effectively for evaluation
- Setting up production data pipelines for continuous evaluation system improvement
- Establishing developer tools and infrastructure to support prompt engineering efforts, prompt lifecycle management, and systematic validation of prompt changes against an evaluation system
Shifting Responsibilities and Roles
The introduction of LLMs may necessitate changes in responsibilities for existing roles. Software engineers should focus more on data quality, representativeness, and working with probabilistic systems. ML engineers must adopt a more product-minded approach, understanding user interactions and intended product behaviour. Product managers must familiarise themselves with Python and Jupyter Notebook to participate in prompt-engineering experiments. LLMs are transformative, not just for products but also for the roles people play within an organisation.
Understanding User Interactions
A common theme across many organisations is that LLMs compel individuals at all levels to understand how their users interact with their products. LLMs not only fundamentally alter existing products, but they also enable entirely new categories of products and capabilities. These introduce new modalities for user interaction, and success is only possible with understanding these interactions and user expectations.
Adapting to New Responsibilities
As you begin to use an LLM in production, it's okay to hire a host of people with different job titles immediately. However, your organisation must be prepared to adapt. Individuals may need to take on responsibilities not traditionally associated with their roles. With this adaptability, your organisation may effectively utilise LLMs in the long term.
The Future of Observability Tools and Practices
Observability tools and practices are a crucial component of modern software development, and their importance escalates when building products that use LLMs. LLMs, being nondeterministic and essentially a black boxes, present unique challenges to reliability and require a different approach to development and iteration.
Teams that already practice observability likely find that their existing tools and playbooks transition well into making LLMs in production more reliable. However, they will face new challenges in integrating that data into development for core prompt engineering and model fine-tuning work.
Looking ahead, we can expect advancements in the following domains:
- Automated tools for Large Language Models (LLMs) instrumentation
- Enhanced solutions for managing the lifecycle of prompt engineering
- Advanced utilities for transferring data from production environments to development ones
- Specialized observability tools focusing on the areas above
- Improved solutions for simplifying the fine-tuning of LLMs
- Ready-to-use evaluation frameworks that streamline the construction and operation of evaluation systems
While innovation will lead to improved tools and practices, it's unlikely that a single tool or practice will solve all the challenges involved in making LLMs more reliable. Therefore, adopting a more general approach to making software more reliable and applying it to LLMs is valuable. Software observability is that approach.