Hosting Large Language Models (LLMs) comes with significant costs determined by factors like model size, computational precision, and hosting duration. For instance, a private GPT-3.5 model's estimated yearly hosting cost is around US$115,500. Before diving into private hosting, it's essential to evaluate existing open model API services and tools to ensure cost-effectiveness.
Factors Affecting LLM Hosting Costs
As enterprises look to integrate AI into their operations, concerns about privacy and security using third-party model APIs arise. Those with the means, particularly larger enterprises, may choose to host their own Large Language Models (LLMs). But what's the cost of doing so? This article delves into the hosting expenses of an LLM, setting aside the intricate costs of training and fine-tuning (a topic for another article).
Hosting LLMs involves various computational and infrastructure-related expenses. Key determinants include:
Model size;
Model architecture;
Desired response rate (inference speed);
Anticipated traffic (throughput); and
Hosting duration.
Additionally, costs for expert personnel to oversee, manage, and sustain the model's infrastructure arise. Our discussion, however, will centralize on the primary computational expenses tied to LLM hosting.
Model Size
Undoubtedly, the model's size heavily influences the costs. Typically, a 175B model like GPT-3.5 necessitates 350GB GPU RAM (double the model size in Billion parameters). On the other hand, a 7B model, such as Falcon 7B, needs 14GB GPU RAM. Using Vultr as a pricing reference (but also consider checking AWS, GCP, and Azure as prices vary), hosting a 175B model would entail using 4 x A100 Nvidia 80GB GPUs (US$1,750 each) plus 1 x A100 Nvidia 40GB GPU (US$875), summing up to US$7,875/month for 360GB GPU RAM.
A comprehensive list of Nvidia GPUs is accessible on this Wikipedia page. As of this article's drafting, Nvidia's H100 tops the list but is not yet commonplace across most cloud services.
Model Floating Point Precision
Besides size, computational precision for LLMs also impacts costs. Typically, models can be 8-bit, 16-bit (FP16 or half precision), or 32-bit (FP32 or full precision). The model precision can be typically found in a its model card. The earlier multiplier of two pertains to 16-bit models. An 8-bit model's multiplier is 1x, whereas for a 32-bit one, it's 4x. Advanced model compression methods, like quantization, have birthed even smaller models (like 4-bit ones) that cut computational needs. For preliminary cost estimation, it's safe to assume a model's precision at 16-bits.
Sequence Length
LLMs' computational demands also hinge on input or sequence length. The more extended the input sequence (measured in tokens, with roughly 750 words equating to about 1000 tokens), the more memory required. Yet, for shorter sequences (~1,000 tokens), memory needs are often negligible during cost estimation. However, memory requirements can soar (quadratically) with extended sequence lengths. Thankfully, innovations like Flash Attention have lessened memory demands for longer sequences. Generally, an added 15-25% buffer to the GPU RAM requirement should suffice for cost projections.
Response Rate (Inference Speed) and Expected Traffic (Throughput)
Response rate gauges the calculations (or FLOPs) per inference against the GPU's capacity (FLOP/s). As an illustration, GPT-3 requires 740 teraFLOPs for inference, while Nvidia’s A100 GPU can handle 312 teraFLOP/s. Thus, with five A100s, the projected inference speed becomes 0.5 seconds. If ten inferences were processed, it'd take about 5 seconds. In practice, other elements can stretch this time. Although there are tactics to accelerate this, such as batching inference requests, we'll dive deeper into that in a forthcoming technical guide.
Duration of Hosting
While many LLMs are continuously hosted, some scenarios demand specific hosting timeframes. Depending on your needs, factor in the above components over the intended duration to get a total cost. For indefinite hosting, a three-year period usually marks a project's expected lifespan.
Bringing It All Together
Combining the elements above, hosting an LLM can be estimated as:
Total Cost = GPU RAM Requirements x GPU Monthly Cost x Months of Hosting
Where,
GPU RAM Requirements = Model Size (Billion Parameters) x Floating Point Precision/ 8-bit x Additional Buffer (15-25%)
Considering the GPT-3.5 example and a 1-year lifecycle, the cost breaks down as:
GPU RAM Requirement = 175 (model size) x 2 (16-bit/8-bit) x 1.25 (125% buffer) = 437.5 GB
Total Cost = [(5 x A100 80GB) x ($1,750/month) + (1 x A100 40GB) x ($875/month)] x 12 months
Total Cost = $9,625/month x 12 = $115,500/year
Hence, the estimated yearly computational cost for hosting a GPT-3.5 model is approximately US$115,500. This is merely a high-level estimate and you can tweak your cost estimate based on the information available for your use case.
Additional Notes
Hosting LLMs can be a significant expense. Organizations might find it more cost-effective to tap into existing open model API services, such as OpenAI APIs, Azure OpenAI APIs, Meta’s Llama 2 APIs or Google PaLM API, and assess the return on investment (ROI) and practicality before deciding to host a private LLM.
Accurately gauging the necessary computational resources can be intricate. Tools and frameworks like NVIDIA's CUDA Toolkit and cuDNN can assist in profiling and estimating GPU RAM needs.
When seeking to optimize computational resources, throughput, and inference speed, it's often pragmatic to start by deploying the model on the existing infrastructure. Then, adjust the GPU allocation as needed to reach an optimal state. For a deeper technical dive into optimizing LLMs in production environments, check out this Hugging Face article.
Comments