top of page

Putting OpenAI's Custom GPT Assistants to the Test



Retrieval augmented generation (RAG) pipelines have become an increasingly popular approach for leveraging powerful LLM technology to manage large amounts of unstructured data. Much has been written about this topic elsewhere, but to summarize, RAG allows users to interface with generative Q&A over a powerful LLM that leverages search over a specific set of documents. This creates a versatile system that has proven useful across various fields, from acting as customer support to navigating medical text.

OpenAI recently unveiled their version of a RAG pipeline, offering a user-friendly Assistant API available for testing in their playground. Setting up an Assistant is straightforward: provide a prompt, link your knowledge base documents, and start asking questions.

Our testing showed that while it excels in simplicity and delivering quality responses for basic inquiries, complications arise as both the knowledge base and user questions become more complex, revealing limitations in maintaining high-level assistance.

Now, let's walk through our experience.

The Experiment

To simulate a customer support scenario, we programmed an OpenAI Assistant with three life insurance plan brochures in PDF format*. We aimed for it to function as a knowledgeable agent, capable of responding to queries based solely on the information contained in these documents.

To draw comparisons, we replicated this setup using our in-house ArticleBot pipeline. Though it shares the same end goal of acting as a support agent, it features enhanced document processing capabilities.

Using GPT-4 as our model to generate responses, we then asked each bot a series of questions of varying complexity to see how well they performed.

*Note: We've chosen to remove or redact company names from the results shown here and replaced them with “Life Insurance Plan”.

Screenshot from Assistant created in OpenAI playground.

Simple Questions

Both bots performed very well with explicit, FAQ-style questions. Consider the following example: when asked, "What is Life Insurance Plan 1?", both the Assistant and ArticleBot offered up accurate information that was congruent with the brochure content. While we saw very little difference in the accuracy of each bot's responses, the OpenAI Assistant tended to give more extended and detailed responses.


OpenAI Assistant response to "What is Life Insurance Plan 1?"

Got It AI ArticleBot response to the same question.

Excerpt from the brochure for Life Insurance Plan 1.

Complex Questions

Most knowledge workers need to answer more complex questions that involve significant breakdown of content and a more nuanced response strategy. This is where the Assistant API falls short.

For example, when asked to compare two types of charges in a specific plan, the Assistant mistakenly cross-references it’s response from another plan, giving an incorrect answer, stating that the plan has a “flat rate of 10%”.

OpenAI Assistant response to a plan comparison question.

From the “Life Insurance Plan 2” brochure, this table describes the surrender charge, which is not a flat rate, but a rate based on payment term and policy year.

ArticleBot also has difficulty providing the answer to the question, but instead follows up with clarifying questions and suggestions for topics it can respond to.

Got It AI's ArticleBot response to the same comparison question.

The value of follow-up conversation is highlighted when a user's query lacks specificity about which product they’re asking about, particularly in a knowledge base with multiple similar products.

In this example, the user asks “Tell me about the plan’s coverage amount,” the OpenAI Assistant provides an accurate answer for Life Insurance Plan 2, but the other plans also provide this information, which is ignored. A better approach is to first clarify which plan the user is referring to.

OpenAI Assistant response to question about entry ages.

Got It AI's ArticleBot response to the same entry age question.

Managing Out of Scope Information

One of the major advantages of LLMs like those that power OpenAI’s Assistants and ChatGPT is the amount of general knowledge it has access to and can synthesize into helpful guidance for its user. However, most use-cases for agent assistants need to focus on a narrow scope of knowledge that is specific to the work the group using it is doing. Best case scenario, allowing the bot to provide too much general knowledge may cloud access to specific knowledge users are trying to access. Worst case, the bot provides information that is problematic or conflicting with your business interests.

We explored this issue by asking the bots “What happens if I die in an accident?” A relevant and practical question when considering a life insurance purchase. However, the Assistant API incorporates general advice for handling logistics around an accidental death. Potentially good advice, but in this context mostly a distraction from getting at useful information contained in the documents. ArticleBot handles this question by referring to two plans that have information about accidental death coverage in the documents.

OpenAI Assistant response to a question that requires disambiguation.

Got It AI ArticleBot response to the same question.


Summary of Results

We tested and compared 22 questions across 8 categories.

Note: a single question may span multiple categories, for example, we categorized the comparing charges example above under both Table Access and Comparison.

Here’s How They Stack Up

This was a difficult test set of questions for both bots, but the additional features of the ArticleBot platform to process the documents along with the user message allowed it to provide more useful responses to the user questions.

The plot below shows the number of acceptable responses to the test set for the Assistant (64% acceptable) and ArticleBot (86% acceptable). Unaccepted answers were categorized as well. Answers were often missing key information contained in the documents (the most common issue for ArticleBot and the OpenAI Assistant) or used general knowledge instead of information in the documents (another common error from the Assistant). The OpenAI Assistant also showed instances where it hallucinated (provided false information), did not disambiguate when needed, or provided irrelevant information.


We also compared each response across bots and found that even where both bots produced acceptable answers, ArticleBot consistently provided a better response.


Final Thoughts

We found the OpenAI Assistant API to be a useful tool for creating a quick assistant for a small knowledge base that can answer very basic questions. However, a pipeline that is curated to the ability to search over documents, compare their contents, and stay on topic provides a much more valuable tool for knowledge workers that need access to detailed and accurate information buried in these complex types of documents.

Stay tuned for more content comparing other popular RAG pipelines and details on how we achieved better results.






bottom of page