LLMOps Article Compilation: A Concise Overview

The term LLMOps stands for Large Language Model Operations. As LLMs arose for building applications for production, new tools and practices emerged as well.

Matías Battocchia
15 min readNov 3, 2023

Contents:

  1. DevOps
  2. MLOps
  3. LLMOps
  4. Inference (self-hosting)

I prepared this material for a LLMOps talk I gave at Humai. It was intended to be a review of fundamental concepts. These are the notes I used.

A pipeline. Generated with DreamStudio.

Disclaimer

I pulled the content from different sources I found the most informative. Around 90% of the content has been copied and pasted. My original contribution is the overall composition, mild to severe edition and commentaries here and there.

Main sources:

Many MLOps articles I read were inspired by

Possible the one I liked the most from a list of +20 articles I read

I liked the take on LLM evaluation from

I took many LLM serving benchmarks from

I have used many other sources in a minor proportion as well. You can follow the credits at the image tags to reach those.

ML systems

Only a small fraction of a real-world ML system is composed of the ML code.

Source: Hidden Technical Debt in Machine Learning Systems

The real challenge is not building an ML model, the challenge is building an integrated ML system and to continuously operate it in production.

We approach the problem by applying DevOps principles to ML systems.

MLOps is an ML engineering culture and practice that aims at unifying ML system development (Dev) and ML system operation (Ops). Practicing MLOps means that you advocate for automation and monitoring at all steps of ML system construction.

DevOps

DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity.

Source: AWS What is DevOps?

DevOps shorten the development cycles. To achieve this benefit, these concepts are normally used:

  • Continuous Integration (CI)
  • Continuous Delivery (CD)
  • Monitoring

Continuous Integration (CI)

Continuous integration is a practice where developers regularly merge their code changes into a central repository, after which automated builds and tests are run, typically unit and integration tests.

It entails both an automation component — a CI or build service — and a cultural component — learning to integrate frequently.

The key goal of continuous integration is to find and address bugs quicker.

Continuous Delivery (CD)

Continuous delivery is a practice where code changes are automatically prepared for a release to production.

It expands upon continuous integration by deploying all code changes to a testing environment and/or a production environment after the build and test stages.

The goal is to always have a deployment-ready build artifact that has passed through a standardized test process.

Monitoring

Organizations monitor metrics and logs to see how their products impact the experience of their users.

By capturing, categorizing, and then analyzing data generated by applications and infrastructure, insights are shed into the root causes of problems.

Active monitoring becomes important as application and infrastructure update frequency increases. Creating alerts or performing real-time analysis also helps.

MLOps

In any ML project, after you define the business use case and establish the success criteria, the process of delivering an ML model to production involves the following steps. These steps can be completed manually or can be completed by an automatic pipeline.

The description of the pipeline differs in granularity depending on the author, it usually comprises these broader groups of steps:

  • Data
  • Model
  • Serving

For MLOps tools and platforms refer to Neptune.ai Landscape in 2023: Top Tools and Platforms, a complete and nicely curated list.

Data

  • Data extraction: Select and integrate the relevant data from various data sources.
  • Data analysis: Perform exploratory data analysis (EDA) to understand the data schema and its characteristics, identify the data preparation and feature engineering.
  • Data preparation: Data cleaning, transformations and feature engineering. Data splitting into training, validation, and test sets.

Model

  • Model training: Implement different algorithms to train various models. Hyper-parameter tuning to get the best performing model.
  • Model evaluation: Evaluate on a holdout test set to assess the model quality.
  • Model validation: Confirm that the model is adequate for deployment — performance is better than a baseline.

Serving

  • Model serving: Export and deploy to a target environment.
  • Model monitoring.

DevOps vs MLOps

ML is experimental in nature. Practitioners try different features, algorithms, modeling techniques, and parameter configurations to find what works best for the problem as quickly as possible.

The challenge is tracking what worked and what did not, and maintaining reproducibility while maximizing code reusability.

ML and other software systems are similar in continuous integration of source control, unit testing, integration testing, and continuous delivery of the software module or the package. However, in ML, there are a few notable differences:

  • CI is no longer only about testing and validating code and components, but also testing and validating data and models.
  • CD is no longer about a single software package or a service, but a system — an ML training pipeline — that should automatically deploy another service — a model prediction service.
  • Continuous training (CT) is a new property, unique to ML systems, that is concerned with automatically retraining and serving the models.

Continuous training (CT)

Automating the ML pipeline lets us achieve continuous delivery of model prediction service.

To automate the process of using new data to retrain models in production, we need to introduce automated

  • data and model validation,
  • metadata management and
  • pipeline triggers.

Data and model validation

The pipeline expects new, live data to produce a new model version that is trained on the new data. It is triggered automatically, therefore, automated data validation and model validation steps are required in the production pipeline to ensure expected behavior.

  • Data distribution changes might occur.
  • New model should be at least as good as the one in production.

Feature store

An optional additional component. A feature store is a centralized repository where you standardize the definition, storage, and access of features for training and serving.

It needs to provide an API for both

  • high-throughput batch serving and
  • low-latency real-time serving

for the feature values, and to support both training and serving workloads.

Metadata management

Information about each execution of the ML pipeline is recorded in order to help with data and artifacts lineage, reproducibility, and comparisons.

ML pipeline triggers

  • On demand.
  • On a schedule.
  • On availability of new training data.
  • On model performance degradation.
  • On significant changes in the data distributions (concept drift).

LLMOps

The short definition is that LLMOps is MLOps for LLMs. It is, essentially, a new set of tools and best practices to manage the lifecycle of LLM-powered applications. A typical lifecycle has this stages:

  1. Selection
  2. Adaptation
  3. Evaluation
  4. Observability
Source: How Is LLMOps Different From MLOps?

Step 1: Selection of a foundation model

Training a foundation model from scratch is complicated, time-consuming, and extremely expensive. Hugging Face Open LLM leaderboard is a great place to look for models.

  • Proprietary or open-source. Proprietary models often come at a financial cost, but they typically offer better performance. Every open-source model is fine-tunable.
  • Commercial license. Some models are open-source but can’t be used for commercial purposes.
  • Parameters. We do see a trend towards smaller models, especially in the open-source space (models ranging from 7–40 billion) that perform well.
  • Speed. The speed of a model is influenced by its size.
  • Context window size (number of tokens). A token roughly translates to 0.75 words for English. Larger context windows can understand and generate longer sequences of text.
  • Training dataset. Some models may be trained on diverse text datasets like internet data, coding scripts, instructions, or human feedback. Others may also be trained on multimodal datasets, like combinations of text and image data.
  • Quality. Quality is context-dependent, meaning what is considered high-quality for one application might not be for another.
  • Fine-tunable. Proprietary models might sometimes offer the option of fine-tuning.

Step 2: Adaptation to downstream tasks

Once we have chosen our foundation model, we can adapt it to downstream tasks in the following ways:

  • Prompt engineering
  • Fine-tuning
  • External data

Prompt engineering

It is a technique to tweak the input so that the output matches our expectations. We can use different tricks to improve our prompt such as zero-shot, few-shot, chain of thought (CoT).

Source: W&B Understanding LLMOps: Large Language Model Operations

Fine-tuning

It can help improve our model’s performance on our specific task. Although this will increase the training efforts, it can reduce the cost of inference. The cost of LLM APIs is dependent on input and output sequence length. Thus, reducing the number of input tokens, reduces API costs because we do not have to provide examples in the prompt anymore.

It is recommended to provide 50–100 high-quality samples per class to fine-tune ChatGPT 3.5.

Source: W&B Understanding LLMOps: Large Language Model Operations

External data

Foundation models often lack contextual information. Because LLMs can hallucinate if they do not have sufficient information, we need to be able to give them access to relevant external data.

  • Retrieval augmented generation (RAG). Relies on knowledge base queries to provide context.
  • Agents. These are systems that chain multiple model calls, connect to available sources (search engines, maps, etc.) and use different tools (calculator, text formatters, etc.) to tackle complex tasks such as planning a trip.
  • Chat. Uses previous prompts and responses as context.
Source: Retrieval Augmented Generation in very few lines of code

Step 3: Evaluation

In classical MLOps, ML models are validated on a hold-out validation set with a metric that indicates the performance. But how do we evaluate the performance of an LLM?

  • On production we could run A/B tests and collect user feedback.
  • Offline methods depends on whether we have labeled test data or not.
Source: AWS FMOps/LLMOps: Operationalize generative AI and differences with MLOps

Test data available

We proceed as we do with traditional ML models.

  • Accuracy metrics. In case of discrete outputs (such as sentiment analysis), we can use standard accuracy metrics such as precision, recall, and F1-score.
  • Similarity metrics. If the output is unstructured (such as a summary), we can use similarity metrics like ROUGE and cosine similarity.

Without test data

Some use cases do not lend themselves to having one true answer, for example, “Create a short children’s story for my 5-year-old daughter”. In such cases, it becomes more challenging to evaluate the models because we do not have labeled test data.

  • Human-in-the-Loop (HIL). In this case, a team of prompt testers will review the responses from a model.
  • LLM-powered evaluation. In this scenario, the prompt testers are replaced by an LLM, ideally one that is more powerful (although perhaps slower and most costly) than the ones being tested.

One interesting approach is to create a reduced human-annotated dataset to evaluate LLM evaluations, this way we can grasp a sense of certainty and compare different LLMs for the task.

Source: Arize LLM Evaluation: The Definitive Guide

Step 4: Observability (Monitoring)

Monitoring allows us to visualize conversations and prompt chains, know the unit economics of our product, like the average cost of a user or conversation, in addition to keep an eye on usage grows and rate limits.

As explained in LLM Monitoring and Observability — A Summary of Techniques and Approaches for Responsible AI, there are three key aspects to look after.

  • API
  • Prompts
  • Responses

For metrics definitions commonly used in observability, check 7 Ways to Monitor Large Language Model Behavior.

API functional monitoring

This includes the number of requests, response time, token usage, costs, and error rates.

Prompts

Evaluator LLMs can be utilized to check for toxicity and sentiment. There are standalone metrics that measure text quality. Embedding similarity distances (such as BERTscore and SelfCheckGPT) from the reference prompts probe relevance.

Reference sets of known adversarial prompts can flag bad actors; evaluator LLMs can also classify prompts as malicious or not.

Over time, prompt monitoring let us know if our users-product interaction has changed.

Responses

Most of the metrics mentioned above can be used for response monitoring as well, albait sometimes with a different meaning. Consider relevance to spot hallucination or divergence from anticipated topics. Toxicity to notice harmful output, and sentiment to acknowledge that the model is responding in the right tone.

Prompt leakage can be discovered by comparing responses to our prompt instructions; similarity metrics work well.

If we have evaluation or reference datasets, we can periodically test our LLM application against these, which can give us a sense of accuracy over time and can alert drift.

When we discover issues, we can export datasets of underperforming output so we can fine-tune our LLM on these classes of troublesome prompts.

Tools

More can be found at Awesome LLMOps repo. Existing tools give us an idea of what is being done in LLMOps. Let us review some features.

With promptfoo, you can:

  • Systematically test prompts and models against predefined test cases.
  • Evaluate quality and catch regressions by comparing LLM outputs side-by-side.
  • Speed up evaluations with caching and concurrency.
  • Score outputs automatically by defining test cases.
  • Use as a CLI, library, or in CI/CD.

Phoenix provides MLOps and LLMOps insights:

  • Traces: Troubleshoot problems related to retrieval and tool execution.
  • Evals: Evaluate your generative model or application’s relevance, toxicity, and more.
  • Embedding analysis: Explore embedding point-clouds and identify clusters of high drift and performance degradation.
  • RAG analysis: Visualize your generative application’s search and retrieval process to improve generation.
  • Structured data analysis: Statistically analyze your structured data by performing A/B analysis, temporal drift analysis, and more.

LangKit supported metrics include:

  • Text quality. Readability, complexity and grade scores.
  • Text relevance. Similarity scores between prompts/responses, and against user-defined themes.
  • Security and privacy. Count of strings matching regex patterns. Similarity scores with respect to refusal of service responses, and known prompt injection attacks such as jailbreaking, prompt extraction, etc.
  • Sentiment and toxicity analysis.

MLOps vs LLMOps

The differences between MLOps and LLMOps are caused by the differences in how we build AI products with classical ML models versus LLMs.

  • Data management. Fine-tuning is similar to MLOps. But prompt engineering is a zero-shot or few-shot learning setting. That means we have few but hand-picked samples.
  • Experimentation. Although fine-tuning will look similar in LLMOps to MLOps, prompt engineering requires a different experimentation setup including management of prompts.
  • Evaluation. In classical MLOps, a model’s performance is evaluated on a hold-out validation set with an evaluation metric. The performance of adapted LLMs is more difficult to evaluate!
  • Cost. While the cost of traditional MLOps usually lies in data collection and model training, the cost of LLMOps lies in inference, which requires GPU-based compute instances or the use of closed-source propriety LLMs API services.
  • Latency. Latency concerns are much more prominent in LLMOps because of computational complexity, model size, and hardware limitations.

Inference (self-hosting)

Given the case we would like to serve an adapted open-source model instead of using a third-party API, there are further considerations to make.

Quantization

To give some examples of how much VRAM it roughly takes to load a model:

  • Llama-2-70b 2 * 70 = 140 GB
  • Falcon-40b 2 * 40 = 80 GB
  • MPT-30b 2 * 30 = 60 GB

Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and activations with low-precision data types like 8-bit integer (int8, occupies 1 byte) instead of the usual 16-bit floating point (float16 or bfloat16, which take 2 bytes).

  • Llama-2–70b 1 * 70 = 70 GB
  • Falcon-40b 1 * 40 = 40 GB
  • MPT-30b 1 * 30 = 30 GB

Reducing the number of bits means the resulting model requires less memory storage and operations like matrix multiplication can be performed much faster with integer arithmetic. It also allows to run models on embedded devices, which sometimes only support integer data types.

Here is a list of libraries/frameworks and quantization techniques they adopt.

It is important to remark that quantization degrades the model performance, here are some benchmarks: A direct comparison between llama.cpp, AutoGPTQ, ExLlama, and transformers perplexities.

Choosing the right inference option

  • Real-time inference. Low-latency or high-throughput online inferences.
  • Serverless inference. Intermittent or unpredictable traffic patterns. It offers a pay-as-you-use facility.
  • Asynchronous inference. Queue requests with large payloads. Can scale down to 0 when there are no requests.
  • Batch transform. Offline processing of large datasets.
Source: Neptune.ai Deploying Large NLP Models: Infrastructure Cost Optimization

Important metrics for LLM serving

  • Time To First Token (TTFT): This metric is driven by the time required to process the prompt and then generate the first output token.
  • Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system.
  • Latency: The overall time it takes for the model to generate the full response for a user. Latency = TTFT + TPOT * the number of tokens to be generated.
  • Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Optimizing inference benefits from various techniques:

  • Operator fusion: Combining different adjacent operators together.
  • Quantization
  • Compression (pruning, sparsity, distillation)
  • Parallelization (data, tensor, pipeline)

Latency

As input prompts lengthen, time to generate the first token starts to consume a substantial portion of total latency. Parallelizing across multiple GPUs helps reduce this latency.

Unlike model training, scaling to more GPUs offers significant diminishing returns for inference latency.

Source: Databricks LLM Inference Performance Engineering: Best Practices

At larger batch sizes, higher parallelism leads to a more significant relative decrease in token latency.

Source: Databricks LLM Inference Performance Engineering: Best Practices

Throughput

We can trade off throughput and time per token by batching requests together.

Grouping queries increases throughput compared to processing queries sequentially, but each query will take longer to complete.

  • Static batching: Client packs multiple prompts.
  • Dynamic batching: Prompts are batched together on the fly inside the server. Does not work well when requests have different parameters.
  • Continuous batching: Currently the SOTA method. It groups sequences together at the iteration level.
Source: Databricks LLM Inference Performance Engineering: Best Practices

Batch size

How well batching works is highly dependent on the request stream. But we can get an upper bound on its performance by benchmarking static batching with uniform requests.

Source: Databricks LLM Inference Performance Engineering: Best Practices

Latency trade-off

Request latency increases with batch size. Shared inference services typically pick a balanced batch size. Users hosting their own models should decide the appropriate latency/throughput trade-off for their applications. In some applications, like chatbots, low latency for fast responses is the top priority. In other applications, like batched processing of unstructured PDFs, we might want to sacrifice the latency to process an individual document to process all of them fast in parallel.

Each line on this plot is obtained by increasing the batch size from 1 to 256.

Source: Databricks LLM Inference Performance Engineering: Best Practices

Inference frameworks

While we could use FastAPI (or Deserve) to serve our models as we have been doing it for so long, nowadays there are servers that specifically target LLMs. These servers implent features such as batching and take care of applying optimizations for us.

There is a great article that does a nice job at comparing different open-source libraries for LLM inference and serving.

Conclusion

We have done a conceptual journey through techniques and practices that enhance LLMs in production. We have done it in historical orden of appereance —DevOps, MLOps, LLMOps— since LLMOps buils upon its ancestors.

Key takeways

  • A ML product is not just a model, it is a pipeline (data, model, serving).
  • Continuous integration, delivery, training facilitate the development and operation of ML products.
  • A LLMOps pipeline has these steps: foundational model, adaptation, evaluation, inference, observability.
  • LLMOps evaluation is the hardest step and requires a human-LLM hybrid approach.

--

--

Matías Battocchia

I studied at Universidad de Buenos Aires. I live in Mendoza, Argentina. Interests: data, NLP, blockchain.