Late 2023 and early 2024 have been busy for our team. We welcomed many new customers, expanded our team, and supported a variety of complex use-cases, from complex RAG to multi-agent systems operating at scale. We are immensely thankful to our customers for their amazing feedback over these past few months.
Today, we are excited to unveil some of our biggest product and infrastructure updates since we started the company. These updates are aimed at making HoneyHive even more powerful for our customers and support them as they dare to build increasingly more ambitious applications.
Here's a preview of what we're launching today:
New Data Model: We migrated to a trace-and-span native data model which is optimized for RAG and multi-agent systemsMore Powerful Evaluators: We added advanced capabilities like Filtering and Sampling to our Custom Evaluators, enabling evaluations on traces and specific spansSession Monitoring: Improved our monitoring interface, added Session analytics, and improved our chart query builder, enabling powerful RAG and agent monitoringDatasets: Launched Datasets - a new way for domain experts to curate, label, and explore datasets for fine-tuning and evaluationSelf-Hosting: Started supporting VPC deployments for enterprises in regulated sectors like Financial Services
What’s New
New Data Model
In the early days of LLM app development, when text-davinci-002
was considered state-of-the-art, application architectures were quite simple. LLM backends mostly consisted of a single API request to a model provider like OpenAI, without any pre/post-processing steps or advanced techniques like RAG, ReAct, routing, etc.
Fast forward to 2024, applications have evolved to include more complex architectures with multiple pre and post-processing steps. Most products now utilize some form of RAG, developers are building complex orchestration frameworks for autonomous agents, and model routing and guardrails are now standard. Although the complexity of LLM application architectures has significantly increased over the last 18 months, developer tooling has not kept pace.
In response to this shift, we overhauled our data model completely. Our underlying data model is now trace and span native. The base unit of data in HoneyHive is an event
, representing a span in a trace. A root event in a trace is of the type session
, while all non-root events in a trace can be of three core types:
model
events: Used to track the execution of any LLM requests.tool
events: Used to track execution of any deterministic functions like requests to vector DBs, requests to an external API, regex parsing, document reranking, and more.chain
events: Used to group together multiple model
and tool
events into composable units that can be evaluated and monitored independently.
A trace from an AI search assistant built by our team.
This new data model provides an unprecedented level of visibility and granularity for our customers and has enabled us to develop powerful new capabilities like agent trajectory evaluations that are truly industry-first.
More Powerful Evaluators Filtering
Filters are powerful if you’re building agents or RAG applications and need to test specific steps in your pipeline, such as your vector database, reranking step, synthesis step, or tool-use step.
Our new data model allows you to run evaluators over specific events, collections of events, chains, or the entire session, depending on your needs. This allows you to break apart your LLM pipeline into smaller units that can be unit tested individually, enabling you to optimize each component independently while still understanding how it affects your overall application’s performance.
Online Sampling
HoneyHive already supported online evaluators for live production monitoring, but running LLM evaluators became quite costly for some of our customers, especially at production scale.
You can now randomly sample a specific percentage of requests for evaluation with HoneyHive. This helps control costs while still enabling you to evaluate your live production traffic and monitor failures as they happen.
Session Monitoring and Advanced Queries
Our new data model enables you to analyze your traces independently from specific steps in your LLM pipeline. This allows you to monitor specific steps (e.g., your vector database queries) and calculate metrics such as Median User Rating per Session
or P99 Retrieval Latency
, providing much more detailed and granular insight into your application.
As part of this shift, we are excited to introduce 3 core chart types today:
Session charts: Helps you observe how users interact with your app over the course of a session and monitor key metrics such as Avg Number of User Turns
, Avg Session Duration
, Median User Rating per Session
, and more.Completion charts: Helps you specifically monitor all LLM requests. This includes key metrics like cost, latency, token usage, and API errors, and any specific evaluators you may have defined (e.g., Keyword Assertions
, Answer Faithfulness
, JSON Validity
, etc.).Event charts: Help you monitor specific chains
or tool
events of interest. Examples include independently monitoring reranking and synthesis steps in a RAG pipeline, monitoring Context Relevance
across retrieved chunks to monitor retriever performance, and more.
These updates are powerful for customers building RAG-based assistants, enabling them to optimize not just the prompt or their model, but also subcomponents in their pipelines such as their chunking strategy, retrieval architecture, tool use, and more.
Datasets
As LLM applications become more customized for specific industries and use-cases, domain experts and PMs are increasingly involved in LLM application development. They play a crucial role in data annotation and labeling to create domain-specific datasets that are used to train custom models, especially in industries like Financial Services, Law, and others where domain knowledge is necessary.
To support this emerging developer persona and their unique needs, we are thrilled to announce a new feature – Datasets — that enables domain experts to curate, explore, and label logs within HoneyHive, allowing them to curate powerful golden datasets for fine-tuning and evaluation. Domain experts can easily provide corrections to requests logged in production in a no-code interface and curate powerful datasets for fine-tuning custom models.
Our most ambitious customers have used Datasets to:
Curate Training Datasets: Allow domain experts to curate and label fine-tuning datasets for optimizing their custom LLMs.Implement Active Learning Pipelines: Set up active learning pipelines to periodically query datasets stored in HoneyHive for fine-tuning and automatic validation.Set up CI Testing: Use reference evaluation datasets for evaluating their application across common scenarios on a daily/weekly basis.
We are excited to support our customers with the data curation and labelling workflows and look forward to seeing what other automations they build with HoneyHive!
Support for Private-Cloud Deployments
Recognizing the increasing need for stringent data privacy and security measures, particularly within regulated sectors such as Financial Services and Healthcare, we now support Virtual Private Cloud (VPC) deployments, starting today with AWS and GCP, and soon to be expanded to Azure.
VPC deployments offer a dedicated and isolated environment for hosting HoneyHive on-premise, ensuring that sensitive data processed by the platform remains within the customer's controlled network environment. This level of isolation allows customers to benefit from HoneyHive's advanced automations and insights without compromising on data privacy and compliance. This initiative underscores our commitment to keeping our customers' data secure and private.