LLMs hallucinate, and especially smaller open source LLM hallucinate more. In a knowledge grounded RAG chatbot scenario, we found that 8% of the Mixtral8x7B-produced responses included hallucinations. We demonstrate here that our TruthChecker model was able to filter out 5% of those hallucinations, resulting in only 3% net hallucinations reaching the end user. This approximates the truthfulness of GPT4 for the same rag pipeline, and the same chatbot task and the same data set. The data said used was a live customer of ours and their service desk knowledge base of hundreds of articles. Note that about 10% responses were deflected as hallucinations and users were shown related content instead of the generated response, which we consider to be a reasonable cost for the reduced hallucination rate. Having achieved this milestone, we now look forward to producing and publishing a public data set with the same benchmarking techniques and challenges.
Improvements to Large Language Models (LLMs) continue to occur at a blinding rate. For the first time, an open-source LLM, Mixtral8x7B (”Mixtral”), is reported to perform similarly to larger closed models like OpenAI’s GPT-3.5, opening new opportunities for development with LLMs without the constraints of a commercial platform.
Still, the problem of LLM hallucinations remains an impediment for adapting these models for business use. At Got It AI, we have a keen focus on addressing the critical challenge of hallucinations in LLMs. This blog explores how Mixtral, when enhanced with our innovative Hallucination Management platform, achieves very low hallucination rates, even rivaling current LLM leader GPT-4.
8% base hallucination rate for Mixtral8x7B
We evaluated #Mixtral for hallucinations, continuing our efforts to evaluate various LLMs for hallucinations for a closed domain RAG type knowledge chatbot use case. This evaluation is part of our ongoing commitment to enhancing the reliability and accuracy of AI-driven conversations, especially in enterprise applications where the precision and privacy of information is crucial.
Method: We used a proprietary dataset consisting of 530 rows, derived from real-world usage of ArticleBot, our RAG knowledge chatbot product. This dataset includes the types of queries and responses that would be encountered in a real-world enterprise knowledge chatbot setting for customer support or similar applications, which tend to be more challenging than the cases represented in research datasets such as HotpotQA and TruthfulQA. Each row of the dataset includes a user query and a set of relevant knowledge snippets. The snippets were retrieved from a knowledge base containing 190 articles related to a set of consumer electronic products.
We configured our ArticleBot RAG pipeline to use Mixtral as the response generator and generated responses for each of the 530 user queries. The responses were manually annotated for relevance and groundedness with >90% inter annotator agreement. If a response is either not relevant or is it not grounded, we consider it to be a hallucination.
Results: The evaluation revealed that Mixtral had a hallucination rate of about 8%, i.e. about 8 out of 100 generated responses -
included false claims,
had risky usage of general knowledge,
made unsupported assumption about user’s intent, or
the response was not relevant to what the user asked.
Previously, we have found that GPT-3.5-Turbo-1106 performs similarly at 8% hallucination rate, while GPT-4-Turbo-1106 has about 3% hallucination rate.
Mixtral is the most accurate open source model we have evaluated so far. Still, Mixtral produces a significant number of hallucinations, which makes it challenging to use it in critical business applications such as automated customer support chatbots.
Is there a way to bring the hallucination rate down inline with the best-in-class currently available, GPT-4?
Hallucination rate (ArticleBot, 530 row chatbot dataset)
8.7 ± 2 %
2.6 ± 2 %
8.3 ± 2 %
3.4 ± 2 %
Mixtral8x7B + TruthChecker → 3% hallucination rate!
We trained a hallucination detection model, “TruthChecker” (TC), which is a Flan-T5 model fine tuned on a synthetically generated dataset of hallucinations based on a large corpus of knowledge base articles and the specific knowledge base that the chatbot will use to generate responses. TruthChecker predicts the probability that a given response contains a hallucination, in context of user utterance, retrieved knowledge base snippets and conversation history. The threshold of the TruthChecker model was such that it tags about 10% responses as hallucinations.
We evaluated Mixtral-generated responses using TruthChecker, using manual annotations as the ground truth. TruthChecker was able to catch about 75% of the 8% hallucinations produced by Mixtral.
10% responses marked as hallucinations would not be shown the end user. Instead, relevant search result snippets are presented to the user to ensure that user sees only factually accurate information. These 10% responses includes about 5% actual hallucinations and 5% false positives.
90% responses not tagged as hallucinations would be shown to the user and those included 3% hallucinations (false negatives), which is similar to about 3% hallucinations users would see in GPT4 generated responses.
Compared to GPT-4, Mixtral can be deployed at a significantly lower cost and can be run fully within your own infrastructure, without any outbound API calls that may leak sensitive information. This combination of Mixtral and TruthChecker represents a major advancement in conversational AI for enterprise applications, where the accuracy of information is paramount. It will allow enterprises to leverage the power of Mixtral8x7B in a cost-effective manner while maintaining high standards of accuracy and reliability in their AI-driven communication use cases.
Pre-TC hallucination rate
TC rejection rate
Post-TC residual hallucination rate
GPT-4 hallucination rate
Open dataset: Here we evaluated Mixtral+TC performance on our RAG dataset. We are now exploring if we can evaluate on a suitable academic of industry dataset. We want to evaluate on the real world use case of a RAG chatbot based on a single knowledge base, and so far we have not found public datasets that align with that use case. Typically, academic datasets like TruthfulQA and HotpotQA are structured such that each example has its own grounding knowledge document instead of all examples referring to a shared set of documents.
Larger dataset: The 530 row dataset used in this experiment is small and contains few hallucinations. So, we plan to repeat this experiment with a larger, 2500 row dataset.
Fine tuning: We anticipate further improvement in hallucination rate if Mixtral is fine tuned on the knowledge base. We have already demonstrated this behavior for other models such as Llama2 7B and Flan-UL2, but we are curious to see if we can go beyond GPT4 performance.
Response regeneration: When TruthChecker marks a response as a hallucination, ideally the original response is discarded and a new response gets generated that is factually accurate. We are working on a response regeneration model that would perform this task.