Start building for free, and only pay when you scale your app.
Free
No credit card required
Get started10k events per month
30d log retention
Up to 2 users
Full evaluation and observability suite
Let's chat
Ideal for scaling teams
Book a demoCustom usage limits
SSO & SAML
VPC self-hosting add-on
Dedicated support and SLA
An event refers to a single trace span, structured log, or metric label combination sent to our API as OTLP or JSON. It captures any relevant data from your system, including all context fields generated by your application's instrumentation.
Automated Evaluators: An automated evaluator is a function (code or LLM) that helps you unit test any arbitrary event or combinations of events to generate a measurable score (and explanation, in case of LLM evaluators). Common examples of auto-evaluators include Context Relevance, Answer Faithfulness, ROUGE, BERTScore, and more. We provide many common evaluators out-of-the-box and allow defining custom evaluators within the platform.
Human Evaluators: We strongly encourage a hybrid-evaluation approach, i.e. combining automated techniques with human oversight. This helps you account for evaluation criteria bias and better align your evaluators with your domain experts' scoring rubric. To enable this, you can define custom scoring rubrics in HoneyHive for domain experts to use when evaluating outputs.
HoneyHive allows you to filter and curate datasets from your production logs. These datasets can be annotated by domain experts within the platform and exported programmatically for fine-tuning models.
You can export datasets curated within HoneyHive using our SDK and use your preferred 3rd-party provider to fine-tune custom models. Our DSL's flexibility also supports more advanced use-cases like Active Learning. Contact us to learn more.
All data is secure and encrypted at rest and in transit. We are SOC-2 compliant, conduct regular penetration tests, and provide flexible hosting solutions (multi-tenant SaaS or self-hosted in VPC) to meet your security and privacy needs.
Yes. We support logging up to 5MB per span, which translates to ~1.7M tokens per span. This is designed to support most current-generation long-context models such as Claude and Gemini.
By default, we do not proxy your requests via our servers. Instead, we store prompts as YAML configurations, which can be deployed and fetched in your application logic using the GET Configuration API.
You can log traces using our SDKs and API endpoints, or async via our batch ingestion endpoint. We offer native SDKs in Python and Typescript with OpenTelemetry support, and provide automatic integrations with popular frameworks like LangChain/LangGraph and LlamaIndex.
For users using other languages, you can send your OpenTelemetry traces to our OTel collector or manually instrument your application using our APIs.