As enterprises look to take their LLM prototypes and POCs to production this year, ensuring production-grade reliability is going to be critical. We believe that with the recent innovation in open source models & GPU inference, the path to building a 3-sigma AI application is increasingly small open-source models that are fine-tuned for specific use-cases like RAG or data extraction on your own enterprise data and running on auto-scaling GPUs.
This is backed by a growing body of evidence: Mixtral Instruct places above GPT-3.5-Turbo on general instruction following , CodeBooga beats GPT-4 on simple coding benchmarks & from subjective evaluation, models such as Airoboros already surpass GPT-4 on story-telling. Companies like Together , Baseten , and Anyscale are achieving extremely low cold-start times, high token throughput, and seamless GPU auto-scaling with scale to zero capability.
Despite this trend, from our conversations with many enterprises at all scales of deployment, we see that companies still largely rely on closed-source providers like OpenAI for even simple NLP tasks like abstractive summarization. They are skeptical of this progress and often unaware of how easy modern tools and cloud services have made it to test, fine-tune, deploy, and monitor open-source models.
Why consider using open source models?
Keeping sensitive data within your cloud environment : Most internal LLM deployments in enterprises make use of extremely sensitive data which they feel very uncomfortable sending outside their cloud environments. Being able to run open source models within their cloud helps them avoid sending data outside their trust boundary.Saving 20-100x on costs & 2-5x on latency : Cost forms another important reason to move off of foundation model providers. Applications which utilize more input tokens than output tokens (eg: RAG) can experience 20x or more cost savings on moving off of OpenAI. Please refer to our calculator to see how much you can save for your application by moving to a per hour pricing model for compute.Improving reliability : As foundation model providers iterate on their models & experience GPU supply crunches, they disrupt pre-existing LLM pipelines with rate limiting , invisible caching , model deprecations , API outages , rapid API schema changes, and opaque model updating. Without any visibility into what’s happening behind the API, building a 3-sigma pipeline with API-based models is essentially impossible no matter how impressive the model may be. Even dedicated instances are being blocked by content policies.
Closed-source model providers build large, general-purpose models that serve as a great platform for building initial prototypes. But to build a production-grade application that can operate at scale, especially in privacy-sensitive domains like financial services and healthcare, it is important to maintain data privacy, reliability, & cost efficiency which is where open-source models excel.
Large foundation models like GPT-4 cannot reliably be replaced yet for complex tasks like agents. But for specialized tasks like writing assistants, summarization, or data extraction, open-source models like Mixtral 8x7B perform at a high degree of accuracy, especially when compared to models like GPT 3.5 Turbo or Claude.
At what scale do they make sense?
Considering cost improvements of transitioning to open source are 20x for a reasonable use-case & the time cost of making the switch is roughly 100 developer hours, if you are spending more than $10,000 a month on inference or have strict latency/reliability requirements, it makes sense to start to transition at that point. Depending on the level of cost savings you might gain for your application, it might make sense to start even earlier - keeping in mind that sufficient inference data needs to be collected to tune the models later.
Selecting the right model for your task
There are a myriad of open-source models available for commercial use today, but picking the right one for your use-case can be challenging. Before making a final decision, we recommend companies to pick the top 2-3 models suitable for their use-case so that you can finally evaluate performance over a larger dataset to select the best one.
Popular benchmarks have become too noisy to be considered reliable . Communities are shifting to evaluating models with carefully curated riddles & back-and-forth conversations (“vibe check”) to differentiate good models from bad ones. Examples of such tests include the Sally test . For storytelling and other creative tasks, manual evaluation still reigns supreme. At the end of the day, evaluation is going to be a task-specific endeavor with people reviewing some of the model responses. We have created a leaderboard of popular leaderboards to cut through the noise & help with the initial process of picking your first set of models. We will be keeping this sheet up-to-date manually.
The two overarching considerations you should think about when picking your initial open-source LLMs:-
What type of task are we going to use this model for? Model performance varies significantly across different use-cases, based on the kind of data it was initially trained on. Some models are explicitly trained and further fine-tuned to be used in conversational, chat-based interfaces, like Meta’s Chat Llama v2. Others are trained to generate code (eg: DeepSeek, CodeBooga), generate SQL queries (eg: SQLCoder, NSQL), follow instructions (eg: Mixtral Instruct), generate engaging stories (eg: MythoMax, Synthia), perform function calling (eg: DeepSeek, WizardCoder), extracting data (eg: Airoboros-L2, Phi), etc.Do we have sufficient data for fine-tuning a smaller model? For most reasoning, data extraction or generative tasks (like writing assistants & copilots), you will likely need to fine-tune your chosen open-source models on 1,000 to 10,000 well-annotated data samples, either collected from your existing application or manually curated and labeled. The more fine-tuning data you have, the smaller model you can pick to fine-tune.
Once you know the model versions and sizes, you can further filter between the task-specific variants based on:-
Context window size : For retrieval augmented generation, summarization and long-format stories, you need to use models with large context windows. Examples of large context window models include: CodeLlama v2 100k, LongAlpaca 64k, MPT Storywriter 64k.Language support : Some open source models have been trained to support languages other than English, depending on the audience of your application, this could be a key consideration in your final choice. Examples of non-English models include: Manticore for Spanish, Cerbero for Italian, SauerkrautLM for German.GPU requirements : Depending on the kind of GPUs you are able to afford, it makes sense to pick a smaller quantization. 70 billion parameters + 4-bit quantization is a good performance vs cost tradeoff point for most simple use-cases. Note: any modes smaller than 34B parameters are often hit-or-miss for production-scale usage. Please refer to this calculator to check which quantization of a model can run on a specific GPU.Available quantization formats : GGUF & GGML are popular model quantization formats that are supported for different LLM runtimes (vLLM, ollama, llama.cpp, etc). Depending on which runtime you are using, you need to ensure the right quantization format is available to load it in that runtime
Lastly, it is important to note that most open-source models require different prompt formats compared to OpenAI’s models, which vary based on the data the model was initially trained on. For example, Llama v2 Chat has been trained on instruction/response pairs, which means you need to prompt the model by prefixing it with “<s>[INST] <<SYS>>{{ system_prompt }}<</SYS>>” to get the best responses. Please refer to the HuggingFace model card for your candidate models to get these details. It’s important to follow the specified format otherwise model performance will be too buggy to do a serious evaluation.
HoneyHive for model selection & evaluation
We now provide access to 100+ open-source models in our Playground through integrations with GPU clouds like Together, Replicate, etc. You can use HoneyHive to test these models, evaluate and benchmark performance over your own enterprise data, and select the right model for your use-case.
If interested, you can join our waitlist to get started with HoneyHive for free. If you’re an enterprise leader looking to learn more about how to evaluate open-source models, get in touch with us.