Datasets

Quick Summary

In deepeval, am evaluation dataset, or just dataset, is a collection of LLMTestCases. There are two approaches to evaluating datasets in deepeval:

using @pytest.mark.parametrize and assert_test
using evaluate

Create An Evaluation Dataset

Your original dataset can be in any format, such as CSV or JSON. Let's take JSON for example, the first step is to create a list of LLMTestCases from your original dataset.

from deepeval.test_case import LLMTestCase
from deepeval.dataset import EvaluationDataset

original_dataset = [
    {
        "input": "What are your operating hours?",
        # Replace with your LLM application output
        "actual_output": "..."
        "context": [
            "Our company operates from 10 AM to 6 PM, Monday to Friday.",
            "We are closed on weekends and public holidays.",
            "Our customer service is available 24/7.",
        ],
    },
    {
        "input": "Do you offer free shipping?",
        # Replace with your LLM application output
        "actual_output": "..."
        "expected_output": "Yes, we offer free shipping on orders over $50."
    },
    {
        "input": "What is your return policy?",
        # Replace with your LLM application output
        "actual_output": "..."
    },
]

test_cases = []
for datapoint in original_dataset:
    input = datapoint.get("input", None)
    actual_output = datapoint.get("actual_output", None)
    expected_output = datapoint.get("expected_output", None)
    context = datapoint.get("context", None)

    test_case = LLMTestCase(
        input=input,
        actual_output=actual_output,
        expected_output=expected_output,
        context=context
    )
    test_cases.append(test_case)

dataset = EvaluationDataset(test_cases=test_cases)

You can also append a test case to an EvaluationDataset through the test_cases instance variable:

...

dataset.test_cases.append(test_case)
# or
dataset.add_test_case(test_case)

Load an Existing Dataset

deepeval offers support for loading datasetes stored in JSON files, CSV files, and hugging face datasets into an EvaluationDataset as test cases.

From JSON

You can add test cases into your EvaluationDataset by supplying a file_path to your .json file. Your .json file should contain an array of objects (or list of dictionaries).

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_json_file(
    # file_path is the absolute path to you .json file
    file_path="example.json",
    input_key_name="query",
    actual_output_key_name="actual_output",
    expected_output_key_name="expected_output",
    context_key_name="context",
)

From CSV

You can add test cases into your EvaluationDataset by supplying a file_path to your .csv file. Your .csv file should contain rows that can be mapped into LLMTestCases through their column names. Remember, context should be a list of strings and in the context of CSV files, it means you have to supply a context_col_delimiter argument to tell deepeval how to split your context cells into a list of strings.

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_csv_file(
    # file_path is the absolute path to you .csv file
    file_path="example.csv",
    input_col_name="query",
    actual_output_col_name="actual_output",
    expected_output_col_name="expected_output",
    context_col_name="context",
    context_col_delimiter= ";"
)

From Hugging Face

from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset()
dataset.add_test_cases_from_hf_dataset(
        dataset_name="Example HF dataset name",
        input_field_name="query",
        actual_output_field_name="actual_output",
        expected_output_field_name="expected_output",
        context_field_name="context",
        split="train",
)

note

Since expected_output and context are optional parameters for an LLMTestCase, expected output and context fields are similarily optional parameters when adding test cases from an existing dataset.

Evaluate Your Dataset With Pytest

Before we begin, we highly recommend logging into Confident AI to keep track of all evaluation results on the cloud:

deepeval login

deepeval utilizes the @pytest.mark.parametrize decorator to loop through entire datasets.

test_bulk.py
import deepeval
from deepeval import assert_test
from deepeval.test_case import LLMTestCase
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset


dataset = EvaluationDataset(test_cases=[...])

@pytest.mark.parametrize(
    "test_case",
    dataset,
)
def test_customer_chatbot(test_case: LLMTestCase):
    hallucination_metric = HallucinationMetric(threshold=0.3)
    answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)
    assert_test(test_case, [hallucination_metric, answer_relevancy_metric])


@deepeval.on_test_run_end
def function_to_be_called_after_test_run():
    print("Test finished!")

You can also use the @deepeval.on_test_run_end decorator to define a hook that will be executed after the deepeval test run has completed.

To run several tests cases at once in parallel, use the optional -n flag followed by a number (that determines the number of processes that will be used) when executing deepeval test run:

deepeval test run test_bulk.py -n 3

Evaluate Your Dataset Without Pytest

Alternately, you can use deepeval's evaluate function to evaluate datasets. This approach avoids the CLI, but does not allow for parallel test execution.

from deepeval import evaluate
from deepeval.metrics import HallucinationMetric, AnswerRelevancyMetric
from deepeval.dataset import EvaluationDataset

dataset = EvaluationDataset(test_cases=[...])

hallucination_metric = HallucinationMetric(threshold=0.3)
answer_relevancy_metric = AnswerRelevancyMetric(threshold=0.5)

dataset.evaluate([hallucination_metric, answer_relevancy_metric])

# You can also call the evaluate() function directly
evaluate(dataset, [hallucination_metric, answer_relevancy_metric])

Quick Summary​

Create An Evaluation Dataset​

Load an Existing Dataset​

From JSON​

From CSV​

From Hugging Face​

Evaluate Your Dataset With Pytest​

Evaluate Your Dataset Without Pytest​