Understanding the DeepEval Framework: A New Approach to LLM Evaluation

Preethi P Unnikrishnan
Dec 23, 2024
5 min read

Updated: Jun 16

Background

At ExaThought, we are committed to enhancing the efficiency and reliability of AI-powered chatbots.

Our latest innovation, the ExaGen AI Chatbot, allows users to interact with unstructured data by uploading PDF documents as a source for questions.

However, testing responses from large language models (LLMs) is uniquely challenging due to their non-deterministic nature — meaning the same input can yield different outputs. To ensure accuracy, relevance, and factual correctness of the chatbot's responses, we explored various testing approaches.

After evaluating several frameworks, we identified DeepEval as the most suitable solution. DeepEval offers a pre-built suite of 14+ LLM evaluation metrics, making it ideal for production use cases.

What is the DeepEval Framework?

Before we deep dive into the evaluation setup, let's quickly check what DeepEval is.

DeepEval is an open-source framework designed to evaluate LLM-generated responses for accuracy, consistency, and relevance. It provides a plug-and-play solution with ready-to-use metrics that are easy to customize. This framework is widely used in AI testing, particularly for LLM-powered applications like chatbots and generative AI systems.

Key Features of DeepEval:

Pre-built metrics: Answer relevance, coherence, hallucination, bias, and more.
Customizable criteria: Tailor the evaluation metrics to specific business use cases.
Seamless integration: Easy to integrate with existing AI pipelines.

Our Approach to the Evaluation (The Methodology)

To ensure ExaGen AI Chatbot delivers coherent, factual, relevant and hallucination or unbiased responses, we set up an automated testing framework using G-Eval, an advanced testing tool that works in conjunction with DeepEval. G-Eval uses Chain-of-Thoughts (CoT) reasoning to assess LLM output, offering flexibility to evaluate any custom criteria with human-like precision.

How We Tested the ExaGen AI Chatbot

Our testing process followed a structured methodology to evaluate key parameters like accuracy, relevance, hallucination, and bias. Here’s a step-by-step overview of the approach:

Question Set Preparation: We compiled a diverse set of questions based on content extracted from a PDF (example: content related to cars) using ChatGPT.
Model Selection: We tested multiple models, including GPT-4o-mini, GPT-4o, and GPT-3.5 Turbo, setting a threshold score of 0.8 for quality.
Custom Criteria Definition: Custom evaluation criteria were defined for metrics such as hallucination detection, coherence, and bias.
Evaluation Parameters: Set the Evaluation parameters like Input, actual output and the Context (ground truth) and collect the response from LLM.
Metric Calculation: Responses were scored using predefined DeepEval metrics like relevance, coherence, faithfulness, hallucination, and bias.

Key Metrics We Evaluated

DeepEval provides comprehensive insights into LLM performance by measuring several key metrics. For the ExaGen AI Chatbot, we focused on:

Metric	Definition
Relevance	Does the response directly answer the user's query?
Accuracy	Are the facts provided correct and verifiable?
Hallucination	Does the LLM introduce information not in the source?
Coherence	Is the response logically consistent and clear?
Bias	Are responses free from any form of bias?

Sample Test Case: Detecting Hallucinations

Test Setup: We extracted content about cars using ChatGPT, which was then uploaded to the ExaGen AI Chatbot for testing.
Objective: Identify if the LLM introduces hallucinated content not found in the source PDF.
Process:
- Send user queries to the API.
- Collect the response from the LLM and match it against the ground truth (source content).
- Calculate the score for hallucination detection using DeepEval.

Example command for execution:

We define the hallucination test cases as shown below, and this method will be called after we get a response from the API, the response would contain the User Query, the LLM response and the context given to the LLM

We do direct call to the API to capture the Context and the LLM response retrieved for each user questions as shown below,

Captured the response and context,

For each user input query, we are calculating the pre-defined deepeval metrics as below by passing the query, output and the context,

Run the test case using the following command: -

deepeval test run test_chatbot_response.py

and the scores are calculated as below,

Run 1: -

Run 2: -

Actionable Insights

DeepEval bridges the gap between large, complex LLMs and actionable evaluation strategies. By incorporating this testing framework, we gained deeper insights into model performance, which led to several important realizations:

Real-World Scenario Testing: DeepEval allows us to create scenario-specific evaluations, such as testing for chatbot interactions that mimic real-world conversations.
Bias Detection: DeepEval highlights areas where models exhibit bias, enabling us to train AI systems that are more ethical and fair.
Continuous Monitoring: Continuous testing is essential as LLMs evolve over time. By running iterative tests, we ensure that our AI systems maintain optimal performance.
RAG Pipeline Evaluation: For retrieval-augmented generation (RAG) pipelines, DeepEval can measure contextual precision, recall, and relevance of retrievers, which is crucial for RAG-based chatbots.

Why Use DeepEval?

We found that by incorporating DeepEval into your workflow, you gain actionable insights that make every iteration of your LLM smarter, faster, and more reliable.

Who Benefits from DeepEval?DeepEval is a vital tool for teams relying on AI-driven LLM applications. Businesses that need reliable, precise, and contextually relevant responses from their LLM-powered chatbots, virtual assistants, or customer service platforms will find this framework indispensable.

Industry Use CasesDeepEval is ideal for industries where chatbot interactions are critical, including:

Healthcare: Virtual health assistants and patient support bots.
Banking & Finance: AI-driven customer service for banking inquiries.
E-commerce & Retail: Virtual shopping assistants and product recommendation engines.
Enterprise Services: Automated customer support, HR assistants, and internal helpdesk bots.

Our Learnings & Findings

Customizable Metrics: G-Eval allows for custom evaluation metrics, giving us the flexibility to measure unique use cases. Users can define and implement custom evaluation criteria to address specific aspects of their use cases and requirements.
Human-like Assessment: By using Chain-of-Thought (CoT) reasoning, G-Eval ensures that LLM outputs are evaluated similarly to how humans would assess them.
Comprehensive Scoring: The use of metrics like answer relevancy, coherence, faithfulness, and coverage provides a well-rounded evaluation of AI responses.
Benchmarking AI Models: With G-Eval, we can benchmark different models like GPT-4o, GPT-3.5 Turbo, etc., identifying the best fit for our business needs.
Ethical AI Development: By identifying and mitigating hallucination, bias, and factual inaccuracies, we support the development of trustworthy and ethical AI models.

In Conclusion

Testing responses from LLMs is more complex than testing traditional software, where outcomes are deterministic. LLMs are probabilistic, meaning responses can vary significantly. This necessitates a comprehensive testing strategy.

At ExaThought, we use a blend of unit testing, functional testing, performance testing, and responsibility testing to ensure that the ExaGen AI Chatbot remains consistent, ethical, and efficient. The DeepEval framework provides a comprehensive approach to assessing the quality of LLM responses, offering a structured method to measure accuracy, relevance, and hallucination.

By integrating DeepEval and G-Eval, we continuously improve our LLMs, ensuring they remain aligned with business needs. Ongoing monitoring and testing enable us to build smarter, faster, and more reliable AI-driven systems.

If you’re building LLM-based applications, consider exploring the DeepEval framework. It offers a structured and effective approach to ensure your models deliver high-quality, factual, and ethical responses. Start with a Proof of Concept (PoC) to see how it aligns with your business objectives.

Understanding the DeepEval Framework: A New Approach to LLM Evaluation

Background

What is the DeepEval Framework?

Our Approach to the Evaluation (The Methodology)

How We Tested the ExaGen AI Chatbot

Key Metrics We Evaluated

Actionable Insights

Why Use DeepEval?

Our Learnings & Findings

In Conclusion

Recent Posts

Services

Industries

Technology

About