Ever wondered why sometimes you get misleading answers from generic LLMs? It's like trying to get directions from a confused stranger, right? This can happen for many reasons, some of them are that the LLM is trained on data that is out of date, it cannot do the math or the calculations or it is just hallucinated.

In LLMs, phenomenon of generating incorrect or misleading information, is known as 'hallucination', one way to mitigate LLMs' tendency to hallucinate is by providing them with additional context during inference time.

In this article, we'll dive into the reasons behind this phenomenon and explore a solution called Retrieval Augmented Generation (RAG), while summarizing the recent research paper that sheds light on this issue and offers practical insights.

Why do LLMs Hallucinate?

There are several factors that can contribute to LLM hallucinations. Here are some of the key reasons:

Overfitting: LLMs are trained on massive datasets, but they can become overly reliant on patterns in their training data, leading to inaccurate responses for new or unfamiliar inputs.

An LLM might memorize that "Paris is the capital of France" appearing frequently with "The Eiffel Tower is 324 meters tall" in its training data. When asked about Paris, it might always mention the Eiffel Tower's height, even when irrelevant.
If trained on many movie reviews where "incredible" often appears with "masterpiece," it might automatically pair these words even when inappropriate, like "The documentary about tax policies was an incredible masterpiece."

Lack of Comprehensive Understanding: While LLMs can process and generate human-like text, they may not fully grasp the nuances of language and context. This can lead to misunderstandings or fabrications.

When asked "What's heavier, a pound of feathers or a pound of lead?" the LLM might say "lead" because it associates lead with heaviness, missing the fundamental concept that a pound is a pound.
If asked "Can fish drown?", it might confidently say "no" because it associates fish with breathing underwater, not understanding that fish can actually drown in oxygen-depleted water.

Bias: Biases present in the training data can influence LLM outputs, leading to biased or discriminatory responses.

When asked to generate a story about a CEO, the LLM might automatically use male pronouns due to gender bias in training data.
If asked about "successful entrepreneurs," it might disproportionately mention Western or Silicon Valley examples, overlooking entrepreneurs from other regions due to training data skew.

Complexity of Language: Natural language is complex and can be ambiguous, making it challenging for LLMs to interpret and generate accurate responses.

For the sentence "The chickens are ready to eat," the LLM might not recognize whether the chickens are ready to be eaten or ready to have their meal.
With "I saw her duck," it might struggle to determine if someone saw a woman lower her head or saw her pet waterfowl.

Model Limitations: LLMs are inherently limited by their architecture and training data, which can contribute to hallucinations.

If asked to solve a complex math problem requiring multiple steps, it might get the early steps right but lose track of intermediate results due to limited working memory.
When asked about current events beyond its training cutoff date, it might combine patterns from its training data to make plausible but incorrect statements about what happened.

These reasons lead to the hallucination of the LLMs response, that’s why always check the output response of the LLMs, this is more in context of public LLMs & sometime it’s not a big issue for most of us, but if you work on some critical real world use cases, like judiciary, defense, different research sectors(medical, space, geographical planning etc), where you need your LLMs should provide an accurate information fetched from the latest valid sources.

To get some more insights on this & how to solve this particular problem, mentioned research paper(Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools) discusses hallucination on different available LLMs along with its rate of occurrence in them by taking the reference of a real world scenario of utilizing LLMs in the legal practice profession, where response’s accuracy & reliability is at the utmost priority because any wrong interpretation of any judgment or inaccurate/misguided source reference of the claimed response can be a bigger problem and here is the summary of the hallucination over different available LLM models

**Figure:** *Comparison of hallucinated and incomplete answers across generative legal research tools(LLMs) -* *Reference Link*

Here hallucinated responses are those that include false statements or falsely assert a source supports a statement. Incomplete responses are those that fail to either address the user’s query or provide proper citations for factual claims.

RAG, a solution to hallucination

Further in the research paper, it proposes a solution of building a system, in which final output will be generated with the help of retrieved info from the input query relevant valid source(similar to our own knowledge) & user query so that there will be no misleading/inaccurate or older info. This system is called Retrieval Augmented Generation(RAG).

How to build a RAG enabled LLM?

let’s see how we can build such a system, what are key steps/concepts, we need to adopt etc.

This diagram shows the overview of the steps involved in building a RAG system so let’s get into more in depth & understand each step & its significance.

Let’s take an example to understand what exactly happens at each step when a user asks something like - What are the current copyright laws in India for AI generated content?

Query Understanding - This is the first step, where in the system we analyze the user's query to understand its intent, context, and key concepts, also will convert the query into embeddings(represent the user query and the documents in the knowledge base as numerical vectors) so that it will be easier to search the relevant data/info from the knowledge base. In our example the system identifies the keywords "copyright laws," "India," and "AI-generated content."
Search - Here we work with a knowledge base(A source, which contains all the valid information) & use embeddings to search the knowledge base for documents/info that are semantically similar to the query. In our example the system recognizes "India" as a country and "AI-generated content" as a specific type of content.

Knowledge Base: The system accesses a knowledge base containing legal documents, news articles, and government reports related to Indian law, intellectual property, and artificial intelligence.
Embedding Similarity: The system compares the query's embedding to the embeddings of documents in the knowledge base to find documents that are semantically similar.

Ranking/Filtering - Here we filter & rank the retrieved documents/info based on relevance and other criteria (e.g., recency, authority). In our example the system determines that the user is seeking information about legal regulations. Here is what happens in this stage.

Relevance Scoring: Documents are ranked based on their similarity to the query, considering factors like keyword matching and semantic similarity.
Recency: More recent documents are given higher priority as they are more likely to contain up-to-date information.
Authority: Documents from reputable sources, such as government agencies or academic institutions, are given higher weight.

Generation - In this step we combine the retrieved documents with the original query to create a comprehensive context by using LLMs e.g., GPT-3 to generate a response based on the retrieved information and the query's intent, this final response has almost accurate info as this is generated from the reference of valid sources , which updates very frequently & have latest updates with it. In our example: The query is converted into a numerical representation (embedding) using techniques like BERT or RoBERTa. Here are some steps that take place during generation.

Context Creation: The retrieved documents are combined with the original query to create a comprehensive context.
Response Generation: The LLM (e.g., GPT-3) processes the context and generates a response that addresses the user's query. The response incorporates information from the retrieved documents, such as relevant laws, case studies, and expert opinions.
Fact Checking: The generated response can be further refined using techniques like fact checking or summarization to ensure accuracy and coherence.

By following these steps in a RAG system, can provide users with accurate and informative responses to legal queries, even in complex and rapidly evolving areas like AI-generated content.

This is all about the building of the RAG system now in the next section let’s discuss the key findings of the paper.

Key Findings

The researchers conducted very thorough experiments on different available models evaluating the performance & hallucination occurrence rate, where these are the key findings

AI-driven legal research tools are not entirely free of hallucinations, meaning they can generate incorrect or misleading information.
Evaluating the performance of proprietary AI-driven legal research tools can be difficult due to the complexity of legal research tasks and the proprietary nature of these systems.
Human judgment(oversight) remains essential to interpret and apply legal research effectively, even when using AI-driven tools.
There are ethical implications to consider when using AI-driven legal research tools, such as liability and transparency.
Overall findings in the paper highlight the importance of using AI-driven legal research tools with caution and in conjunction with human expertise to ensure accurate and reliable decision critical applications like legal advice.

Results:

Reduced Hallucinations: RAG systems can significantly reduce the occurrence of hallucinations in AI-powered legal research tools.
Improved Accuracy: By relying on a knowledge base of reliable information, RAG systems can generate more accurate and informative responses.

Why it Matters:

Legal Profession: In the legal profession, where accuracy and reliability are paramount, RAG systems can help to improve the quality of legal research and decision-making.
Trust and Confidence: By reducing the risk of hallucinations, RAG systems can help to build trust and confidence in AI powered legal tools.

Conclusion: Our Thoughts

RAG systems offer a promising solution to the problem of hallucinations in AI-powered legal research tools.
By combining the power of LLMs with a knowledge base, RAG systems can generate more accurate and reliable responses.
However, it is important to note that human oversight remains essential for interpreting and applying legal research effectively.
As AI technology continues to evolve, RAG systems are likely to play an increasingly important role in the legal profession.

Blog Story

Thank you reading, hope you found this helpful, subscribe to my newsletter for more such content…..Keep Learning, Keep Growing

RAG Cheatsheet

Why do LLMs Hallucinate?

RAG, a solution to hallucination

How to build a RAG enabled LLM?

Key Findings

Results:

Why it Matters:

Conclusion: Our Thoughts

Blog Story

Subscribe to my newsletter below 👇

Book Appointment Become a Sponsor

RAG Cheatsheet

Why do LLMs Hallucinate?

RAG, a solution to hallucination

How to build a RAG enabled LLM?

Key Findings

Results:

Why it Matters:

Conclusion: Our Thoughts

Blog Story

My AI Adventure in Casablanca, Morocco 🇲🇦

What is Agentic RAG? Simplest explanation

Subscribe to my newsletter below 👇

Book Appointment Become a Sponsor