Once these components are in place, more complex LLM challenges will require nuanced approaches and considerations—from infrastructure to capabilities, risk mitigation, and talent.
Deploying LLMs as a backend
Inferencing with traditional ML models typically involves packaging a model object as a container and deploying it on an inferencing server. As the demands on the model increase—more requests and more customers require more run-time decisions (higher QPS within a latency bound)—all it takes to scale the model is to add more containers and servers. In most enterprise settings, CPUs work fine for traditional model inferencing. But hosting LLMs is a much more complex process which requires additional considerations.
LLMs are comprised of tokens—the basic units of a word that the model uses to generate human-like language. They generally make predictions on a token-by-token basis in an autoregressive manner, based on previously generated tokens until a stop word is reached. The process can become cumbersome quickly: tokenizations vary based on the model, task, language, and computational resources. Engineers deploying LLMs need not only infrastructure experience, such as deploying containers in the cloud, they also need to know the latest techniques to keep the inferencing cost manageable and meet performance SLAs.
Vector databases as knowledge repositories
Deploying LLMs in an enterprise context means vector databases and other knowledge bases must be established, and they work together in real time with document repositories and language models to produce reasonable, contextually relevant, and accurate outputs. For example, a retailer may use an LLM to power a conversation with a customer over a messaging interface. The model needs access to a database with real-time business data to call up accurate, up-to-date information about recent interactions, the product catalog, conversation history, company policies regarding return policy, recent promotions and ads in the market, customer service guidelines, and FAQs. These knowledge repositories are increasingly developed as vector databases for fast retrieval against queries via vector search and indexing algorithms.
Training and fine-tuning with hardware accelerators
LLMs have an additional challenge: fine-tuning for optimal performance against specific enterprise tasks. Large enterprise language models could have billions of parameters. This requires more sophisticated approaches than traditional ML models, including a persistent compute cluster with high-speed network interfaces and hardware accelerators such as GPUs (see below) for training and fine-tuning. Once trained, these large models also need multi-GPU nodes for inferencing with memory optimizations and distributed computing enabled.
To meet computational demands, organizations will need to make more extensive investments in specialized GPU clusters or other hardware accelerators. These programmable hardware devices can be customized to accelerate specific computations such as matrix-vector operations. Public cloud infrastructure is an important enabler for these clusters.
A new approach to governance and guardrails
Risk mitigation is paramount throughout the entire lifecycle of the model. Observability, logging, and tracing are core components of MLOps processes, which help monitor models for accuracy, performance, data quality, and drift after their release. This is critical for LLMs too, but there are additional infrastructure layers to consider.
LLMs can “hallucinate,” where they occasionally output false knowledge. Organizations need proper guardrails—controls that enforce a specific format or policy—to ensure LLMs in production return acceptable responses. Traditional ML models rely on quantitative, statistical approaches to apply root cause analyses to model inaccuracy and drift in production. With LLMs, this is more subjective: it may involve running a qualitative scoring of the LLM’s outputs, then running it against an API with pre-set guardrails to ensure an acceptable answer.