Survey of datasets for evaluating closed-domain RAG chatbots for LLM hallucinations

Nikhil Varghese
Mar 5, 2024
4 min read

This post highlights the limitations of current datasets and proposes the development of a new benchmark tailored to the unique needs of evaluating RAG systems, aimed at improving their accuracy and reliability in real-world applications.

With the widespread availability of LLMs, a new class of conversational applications has emerged - Retrieval Augmented Generation (RAG) based chatbots. Such applications are configured with a knowledge base as the source of grounding information to use. They retrieve relevant information and feed that to LLMs via prompts. Then, the LLMs use supplied information to generate responses. These responses may or may not be accurate though, because all LLMs hallucinate occasionally, some more than others, with the smaller open source LLMs performing worse.

Overview of a RAG system with typical hallucination rates

Today, we see a vibrant landscape of open-domain RAG applications like Gemini, ChatGPT, Microsoft Copilot, Perplexity, and HuggingChat. Frameworks like LlamaIndex, Langchain, and most recently GPT Store help in the development of RAG applications for closed-domain or specific knowledge bases. These pipelines power chatbots, virtual assistants, and other interactive systems.

At Got It AI, we have focused our efforts on developing a RAG platform for real world closed domain enterprise use cases. How can we evaluate the performance (both correctness and hallucinations) of our RAG platform? So far, we have primarily used proprietary datasets, and those cannot be used to compare performance with other RAG systems for various reasons. So the question arises - are there open source datasets and benchmarks available to compare closed domain RAG systems?

In this blog, we explore existing open source datasets for evaluating closed domain RAG systems. We conclude that a gap exists for a dataset/benchmark to evaluate the full feature breadth of such RAG systems, especially for evaluating hallucinations. We reach the conclusion that a new benchmark is needed and list the features that such a benchmark should evaluate.

An example of a hallucination in a RAG system. The line underlined in red is a hallucination. The internet is filled with answers to a similar question but not exactly the same. The LLM gets confused with the retrieved results and generates an inconsistent answer.

Traditional Hallucination Datasets Fall Short

The AI community has made several strides to make LLMs produce fewer hallucinations, and improve their reasoning skills. To support these efforts, several hallucination datasets with diverse objectives have been conceived. For one, there is a growing trend of using old datasets as seeds, and augmenting them as needed, to serve the newer objective of measuring hallucinations in foundational LLMs. Popular examples are Halueval (uses HotpotQA, OpenDialKG and CNN/Daily Mail as seeds) and FaithDial (uses Wizard of Wikipedia as the seed). The HuggingFace Hallucination leaderboard does a great job of compiling several hallucination datasets.

Fava is an effort that introduces a novel taxonomy for hallucination detection, leading the path to classify hallucination into more informative categories and also perform targeted correction. A leaderboard by Vectara focuses on a related task of summarization with a limited context window of 512 tokens.

FinanceBench is a closed-domain dataset, primarily focussed on single-turn QA pairs spread across a mix of difficulty levels. Most recently, LlamaIndex introduced llama datasets to evaluate open and closed-domain RAG applications end-to-end. This is a great effort to systematically evaluate the efficacy of RAG applications by evaluating each part of the pipeline. They measure Context Similarity, Faithfulness, Relevance and Correctness through an LLM evaluation using a reference answer as the ground truth. However, their approach relies on reference responses to benchmark different RAG systems and is not particularly focussed on hallucination detection, which should ideally include examples of different types of hallucinations.

Table 1: A comparison of popular hallucination detection datasets for QA and dialogue agents with Got It AI’s proprietary dataset. Green highlighted cells are desirable features.

Existing hallucination datasets have their limitations.

Open-domain: Most of the hallucination datasets rely on are based on open-domain data since they aim to assess hallucinations by foundational LLMs.
Lack of dialog context: TruthfulQA, a popular benchmark, while valuable, lacks dialogue context and relies on a single text snippet for the knowledge. Halueval and FaithDial go a step further and show hallucinations in a dialogue context.
Question Type: RAG applications that are deployed in a conversational context are faced with a diverse set of user questions. Most hallucination datasets only include the knowledge-seeking QA pairs which do not capture the full range of the queries like chitchat, comparison questions, off-topic questions, queries needing a multi-step retrieval (for e.g. How do i use product X and where can I buy it?) and queries where the user intent is requires clarification.
Single knowledge chunks: TruthfulQA, FaithDial and Halueval use a single knowledge chunk. In production RAG systems, you often get best performance when using several chunks. Several knowledge chunks are often similar in nature and create ambiguity and conflict when the LLM generates a response, introducing a different dimension of complexity.
Knowledge structure: Most hallucination datasets use simple text passage. Contrast this to complex structured data interspersed with plain text in many knowledge bases.
Length of knowledge chunk: Chunk size makes a large difference in the efficacy of the RAG systems. Ideally, we want to have a combination of small and large chunks (300-500 tokens) while preserving context of each chunk.
Annotations: We find that for the vast majority of the RAG applications, LLM hallucinations are because the generated response was either unfaithful to the knowledge or was irrelevant in the conversational context. However, most hallucination detection datasets only contain faithfulness labels. In addition, the presence of a reason associated with the each annotation is helpful to build hallucination detection systems.
Dataset size: A dataset size of roughly ~1k is needed to train and test LLMs. Synthetic hallucinations are useful but not sufficient to build robust RAG systems.
Contamination: Since the vast majority of leading LLMs do not reveal the source of their training data, even a well-performing model might be "cheating" by simply memorizing “test” examples of these hallucination datasets.

Next steps

The above analysis indicates that there is no suitable dataset available for properly evaluating RAG systems, including hallucination. As a result, we have decided to compile a brand new dataset and publish it for the community to use. The characteristics for the new dataset will be influenced by the “Got It AI (Proposed)” column above.

For people who want to go deeper into hallucinations, we recommend exploring the awesome-hallucination-detection github repo as a starting point. It summarizes the majority of the hallucination detection work by the open-source community.