DeepSeek-R1's ultra-high hallucination rate: Why do large models always talk nonsense?

2025/02/14 10:49

Source: Tencent Technology

The DeepSeek series of models performs very well in many aspects, but the "hallucination" problem is still a major challenge it faces.

In the Vectara HHEM artificial intelligence hallucination test (an industry authoritative test that evaluates the hallucination rate of the model by detecting whether the content generated by the language model is consistent with the original evidence, helping to optimize and select the model), DeepSeek-R1 showed a hallucination rate of 14.3%.

Figure: Vectara HHEM artificial intelligence hallucination test results

Obviously, the hallucination rate of DeepSeek-R1 is not only nearly 4 times that of DeepSeek-V3, but also far exceeds the industry average.

In a not-so-rigorous big-model chess game organized by blogger Levy Rozman (an American chess celebrity with 6 million followers), Deepseek "cheated" far more times than ChatGPT:

For example, after just a few moves, DeepSeek-R1 took the initiative to send a pawn to its opponent;

In the later stages, DeepSeek-R1 told ChatGPT that the rules of chess had been updated, and used a pawn to capture ChatGPT's queen, a move that caught ChatGPT off guard;

Finally, DeepSeek-R1 gave ChatGPT a bunch of output, telling it that it had won, and ChatGPT actually agreed to admit defeat, while DeepSeek-R1 ended up winning.

Although this is an entertaining video with not-so-rigorous rules and standards, it can also be seen that big models really like to "talk nonsense" seriously, and can even trick another big model.

For humans, the problem of large model hallucination is like a sword of Damocles hanging over the road of AI development. Behind the 14.3% hallucination rate, there are some questions worth our deep thinking:

Why do large models produce hallucinations? Is it a defect or an advantage?
When DeepSeek- R1 shows amazing creativity, how serious is its hallucination problem at the same time?
In which fields does the large model hallucination mainly appear?
An ultimate problem: How to make large models both creative and less hallucinatory?

Tencent Technology invited Dr. Li Wei, former vice president of engineering of the Big Model Team of Mobvoi, to sort out the issues related to the Big Model Hallucination in detail and take you to understand it in one article:

Photo: Li Wei, former vice president of engineering of the Big Model Team of Mobvoi and former chief scientist of Netbase

1. Why does the Big Model "have hallucinations"?

This is a classic problem of the Big Model. In fact, the Big Model is like a "super conversation picker". You give it the first half of a sentence, and it predicts what the second half of the sentence should be based on the massive knowledge it has learned. It learns things just like the human brain remembers things. It is impossible to remember every word clearly. It will compress and generalize, grasp the main idea and find patterns.

For example, if you ask it "How tall is Yao Ming", it is likely to be correct, because this knowledge point is very prominent and it can remember it firmly. But if you ask "How tall is Lao Wang next door", it may be confused because it has never seen Lao Wang.

However, its design principle determines that it must take over. At this time, it automatically "brain fills in the blanks" and makes up a number based on the learned concept of "how tall is the average person". This is "hallucination".

So, how does the hallucination come about?

The essence of hallucination is to fill in the blanks, to fill in the blanks.

"White" is a specific fact. If this fact does not have enough information redundancy in the training data, the model will not be able to remember it (scattered facts are equivalent to noise). If you can't remember it, use hallucinations to fill in the blanks and make up details.

Hallucination is by no means an arbitrary fabrication without constraints. The big model is a probabilistic model, and the constraints are the preceding conditions in the conditional probability. The false facts selected by the hallucination need to match the value type required by the filler, that is, the corresponding superordinate node concept that conforms to the ontology/taxonomy. "Zhang San" can be hallucinated as "Li Si", but it is unlikely to be hallucinated as "stone".

There is a saying in literary theory called artistic truth. The so-called artistic truth means that although literary and artistic creation may deviate from the facts of this world, it is a possible ideal imagination of the digital world. The hallucination of the big model belongs to this kind of situation.

The knowledge learning process (training stage) of the big model is a process of information compression; the big model answers questions, which is an information decoding process (inference stage). It is like raising and lowering the dimension. If a fact is not redundant enough, it will be generalized into a slot of a superordinate concept. In the generation stage, this slot must be concretely filled.

The fact of "Zhang San" is forgotten, but the constraint of the slot "human" is still there. To fill in the blank, find the most reasonable entity that is most consistent with the concept of slot, so the illusion of "Li Si" or "Wang Wu" can replace "Zhang San". This is how novelists work, and the characters and stories are all made up. Neither the writer himself nor the reader thinks that this is a lie, but the pursuit of truth, goodness and beauty is on another level.

The same is true for the big model. The big model is a natural artist, not a database for rote memorization. "Putting the wrong hat on someone" and "calling a deer a horse" are very natural in the illusion of the big model, because Zhang and Li are similar, and the horse and the deer are also on the same extension line. In the sense of generalization and compression, the two are equivalent.

However, to a certain extent, illusion is imagination (regardless of praise or criticism), that is, creativity! Think about it, which of the great literary and artistic works of mankind is not wild and imaginative? If everything has to be exactly the same as reality, art will become a camera, so what's the point?

As Harari said in "Sapiens: A Brief History of Humankind", the reason why humans can become the overlord of the earth is because we can "tell stories" and create myths, religions, countries, currencies and other things that do not exist in reality. These are all "illusions", but they are the driving force behind the birth and development of civilization.

2. How serious is the hallucination problem of DeepSeek-R1?

Its hallucination problem is very serious. Previously, the academic community generally agreed with OpenAI's statement that reasoning enhancement will significantly reduce hallucinations. I once discussed with a person in charge of a large model company, and he particularly emphasized the positive role of reasoning in reducing hallucinations.

But the performance of R1 gave an opposite result.

According to Vectara's test, the hallucination rate of R1 is indeed much higher than that of V3. The hallucination rate of R1 is 14.3%, which is significantly higher than the 3.9% of its predecessor V3. This is directly related to its enhanced "chain of thought" (CoT) and creativity. R1 is indeed very good at reasoning, writing poetry and novels, but the "side effect" that comes with it is that it has more hallucinations.

Specific to R1, the increase in hallucinations is mainly due to the following reasons:

First, the standard test for hallucinations uses summary tasks. We know that the ability to summarize is already quite mature at the stage of the base model. In this case, reinforcement may have the opposite effect, just like using a cannon to hit a mosquito. Excessive force increases the possibility of hallucinations and fabrications.

Second, R1's long chain of thought reinforcement learning is not specially optimized for relatively simple tasks such as summarization, translation, and news writing that have strict factual requirements, but tries to add various levels of thinking to all tasks.

From its transparent chain of thought output, it can be seen that even when faced with a simple instruction, it will tirelessly understand and extend it from different angles. Too much is as bad as too little. The complexity of these simple tasks will lead the results to deviate from the performance and increase hallucinations.

In addition, DeepSeek-R1 may have given more rewards to the model's creativity during the reinforcement learning training of liberal arts tasks, causing the model to be more creative when generating content and more likely to deviate from the facts.

We know that for mathematics and code, R1's supervisory signals come from the gold standard of these questions (standard answers in exercise books or test cases of code). For liberal arts tasks, they use V3 or V3's reward model to judge good or bad. Obviously, the current system preference is to encourage creativity.

In addition, users' feedback is more about encouraging and appreciating the creativity they see. Most people are not sensitive to the perception of illusions, especially when large models are smooth and smooth, and it is even more difficult to identify illusions. For most front-line developers, this kind of feedback from users can easily prompt them to work harder to strengthen creativity, rather than dealing with "illusions", one of the most headache-inducing problems in the field of large models.

Specifically from a technical perspective, R1 will automatically add a long chain of thinking to the user's simple instructions, which is equivalent to complicating a simple and clear task.

A simple instruction, it also repeatedly understands and extends from different angles (CoT thinking chain is like "little nine nines", which is the inner monologue of an entity when following instructions). Thinking chain changes the conditional part before the autoregressive probability model generates the answer, which naturally affects the final output.

It is different from the V3 model as follows:

V3: query --〉answer

R1: query+CoT --〉answer For tasks that V3 can already complete well, such as summarization or translation, any long-term guidance of thinking chain may lead to deviation or development, which provides a breeding ground for hallucinations.

3. In which fields does the big model hallucination mainly appear?

If R1's ability is divided into "liberal arts" and "science", it has strong logic in "science" such as mathematics and code, and relatively few hallucinations.

But in the field of language creation, especially in the summary task being tested now, the hallucination problem is much more obvious. This is more of a side effect of R1's explosive language creativity.

Compared to o1, R1's most amazing achievement is that it has successfully extended its reasoning ability of mathematics and code to the field of language creation, especially its outstanding performance in Chinese. There are countless wonderful chapters of R1 circulating on the Internet. In terms of writing, it obviously exceeds 99% of humans, and graduate students in literature and even professors of Chinese studies are full of praise.

But you see, asking it to make a summary is originally a very simple task, but it has to give you "play", and as a result, it is easy to "make up" something that is not in the original text. As mentioned earlier, this is because it is too strong in "liberal arts" and a bit "overdoing it".

Here we have to talk about the subtle relationship between enhanced reasoning ability and hallucinations.

They are not simply positively or negatively correlated. The mean and median HHEM scores of the GPT series's reasoning model o1 are lower than its general model GPT-4o (see the figure below). However, when we compared R1 with its base model V3, we found that the hallucination did increase significantly after adding reasoning reinforcement.

Figure: HHEM score statistics of GPT-o1 and 4o. The lower the HHEM score, the lower the hallucination.

Compared with the base model, o1 reduced the hallucination, while R1 increased the hallucination. This may be because R1 overworked in the liberal arts thinking chain.

As a follower, R1 successfully transferred the CoT empowerment in mathematics and code to language and text creation, but if you are not careful, the side effects will also appear. R1 likes "divergent thinking" very much. If you give it a simple instruction, it can come up with a lot of things, and its chain of thinking can circle the earth three times.

This seems to indicate that in the process of strengthening creativity, R1 inevitably increases the companion of creativity: illusion.

Language ability can actually be divided into two categories: one requires high creativity, such as writing poetry and novels; the other requires high authenticity, such as news reports, translations or summaries. R1 is most praised for the former, which may also be the focus of the R&D team, but there are side effects in the latter.

This reminds me of what the ancient Chinese said about "faithfulness, expressiveness and elegance", which has been difficult to achieve since ancient times. We have seen many examples of sacrificing "faithfulness" for "elegance". The exaggerated rhetoric in literary creation is an important means and example. There are also precedents for sacrificing "elegance" for "faithfulness", such as the "hard translation" advocated by Mr. Lu Xun.

Interestingly, we humans have always been double-standard in this regard, but we have a switch in our hearts that can be switched at any time. When we read novels and watch movies, we turn the switch to the creative side and don't bother about whether the details are true or not; but once we switch to the news channel, we have zero tolerance for false content.

4. An ultimate problem:How can the big model be both creativeand less hallucinatory?

People tend to believe in content that seems to be clear, self-consistent, and detailed. While many people are amazed by the creativity of R1, they are now slowly beginning to notice this illusion phenomenon and begin to be vigilant. But more people are still immersed in the creative surprise it brings us, and we need to enhance the public's awareness of model illusions. You can "grasp both hands":

Stay vigilant: Don't believe everything the big model says, especially when it comes to facts. The most likely places to produce hallucinations are entities or data such as names, place names, time, and location, so be especially careful.

Cross-validation: For important details, you can check the original data online or ask experts around you to see if the statements are consistent.

Guide the model: You can add some limiting conditions when asking questions, such as "please be faithful to the original text" and "please check the facts", etc., so as to guide the model to reduce illusions.

Search (online search): For users, for many questions, especially news and current affairs, in addition to the DeepThink button (pressing it will enter the R1 slow thinking mode), don't forget to press another button Search.

Adding online search will effectively reduce illusions. Search is equivalent to an additional database, and the added data helps to make up for the model's own ignorance of details.

Enjoy creativity: If you need inspiration and creativity, the illusion of the big model will bring you surprises.

You might as well regard the illusion of the big model as the "possibility of a parallel world". Just like a novelist writing a novel, although it is fictional, it is also a kind of "artistic truth". Derived from life, higher than life. The big model is derived from data and higher than data. The big model compresses knowledge systems and common sense, not facts, which are objects in the database.

The illusion of the big model is actually what it "brain-fills", but the basis for its "brain-filling" is the vast amount of knowledge and laws it has learned. Therefore, its illusions are often not random, but have "internal rationality", which makes them smooth and seamless, and the lies sound like the truth, but at the same time more confusing. Friends who are new to the big model need to be particularly careful and not believe them easily.

For ordinary users, it is important to understand the characteristics of illusions. For example, if you ask encyclopedia questions such as "How long is the Yangtze River" with sufficient information redundancy, the big model will not make mistakes, as these facts are engraved in the model parameters. But if you ask about the length of an unknown stream or a fictional river, the model will start the "reasonable blank filling" mechanism to make up.

It can be said that human language itself is a hotbed of illusions.

Language enables humans to create concepts of non-real entities such as myths, religions, countries, companies, and currencies, as well as metaphysical ideologies such as ideals and beliefs. In Sapiens: A Brief History of Humankind, Harari emphasized the fundamental role of hallucination in civilization: the emergence of language empowered humans to hallucinate (“tell stories”). Hallucination is the catalyst of civilization. Humans are the only entities that can “lie” – except for LLM.

Is there any way in the future to make large models both creative and less hallucinatory?

This is definitely one of the “ultimate problems” in the field of AI large models! Now everyone is trying to find ways, such as:

More refined training: During training, treat different types of tasks differently, so that the model knows when to be “honest” and when to “let go”.

Fine tuning and/or reinforcement (rl) for tasks can alleviate this contradiction. Tasks such as summarization, rewriting, translation, and reporting require special care and balance, because they have both a little need for re-creation (such as writing style) and a nature that requires faithful content.

Specifically, the R1 training pipeline consists of four processes: fine-tuning 1, reinforcement 1, fine-tuning 2, and reinforcement 2. Reinforcement 2 is mainly reinforcement that aligns with human preferences. In terms of creativity and fidelity, this process currently seems to lean towards the former, and it can be balanced later. Perhaps more importantly, in the fine-tuning 2 of stage 3, constraints are strengthened for different tasks, for example, adding summary supervision data to guide faithful and plain results.

Routing: In the future, there may be a "dispatcher" who arranges different models to handle tasks according to the type of task. For example, simple tasks are handed over to V3 or calling tools, and complex tasks that require slow thinking are handed over to R1.

For example, to identify an arithmetic task, write a simple code operation, which is equivalent to calling a calculator. This is not the case at present. I tested a nine-digit multiplication yesterday, and R1 thought for more than three minutes. The thought chain can be printed out to spread out a street, breaking down the reasoning step by step. Although the answer is correct in the end, it is completely unreasonable to use the so-called test time compute (model test computing resources) chain of thought (CoT) instead of function call (calling function) to solve arithmetic problems. There is no need to consume so many computing resources and tokens to do explicit reasoning for something that can be done with one line of calculation code.

These are all foreseeable routing (implementation paths), especially in the agent era. R1 CoT does not have to cover everything, and in addition to the illusion problem, it will also waste resources and be environmentally unfriendly.

Preview

Gain a broader understanding of the crypto industry through informative reports, and engage in in-depth discussions with other like-minded authors and readers. You are welcome to join us in our growing Coinlive community:https://t.me/CoinliveSG

Add Comment

LoginLeave your comments

0 Comments

Earliest

Load more comments

Live Updates

18 hours ago
Trump Claims Tariffs Have a HUGE Positive Impact on the Stock Market: Is This True or False?
Bullish
Bearish
18 hours ago
Investors Fear Rug Pull After CrediX Finance Wipeout
Bullish
Bearish
18 hours ago
Bitcoin Season Or Altcoin Season? Shiba Inu Exec Outlines What’s Happening
Bullish
Bearish
18 hours ago
Angel investor warns capital markets are being ‘sunset’ by crypto
Bullish
Bearish
18 hours ago
Coinbase's DEX Trading Rollout: Hidden Gem Tokens, Instant Listings, Zero Network Fees
Bullish
Bearish
18 hours ago
Joe McCann’s $1.5B Solana DAT SPAC deal is cancelled
Bullish
Bearish
19 hours ago
China-Russia trade peaks amid Trump’s tariff threat over Russian oil
Bullish
Bearish
19 hours ago
Bitcoin miner Terawulf’s net losses continue to rise as it invests deeper in high-performance computing
Bullish
Bearish
19 hours ago
ATOM Jumps 4% on Institutional Demand Before Late-Hour Reversal
Bullish
Bearish
19 hours ago
NEAR Rises 2% as Institutional Traders Drive Volume Amid Volatile Swings
Bullish
Bearish

More

Trending News

More

DeepSeek-R1's ultra-high hallucination rate: Why do large models always talk nonsense?

1. Why does the Big Model "have hallucinations"?

2. How serious is the hallucination problem of DeepSeek-R1?

3. In which fields does the big model hallucination mainly appear?

4. An ultimate problem:How can the big model be both creativeand less hallucinatory?

Live Updates

Trending News

MetaMask Teams Up with Ramp Network to Enable Direct Ether Withdrawals to Fiat from Layer-2 Networks

LinkedIn Faces Class Action Lawsuit Over Misuse of InMail Data for AI Without Consent

Taiwan Proposes Local Stablecoins to Boost Economic Growth and Crypto Adoption

CZ Returns to Lead AI and Strategic Investments, Reinvents Binance Labs as YZi Labs: Mere Name Change or Major Revamp?

Ledger Co-Founder David Balland and Wife Rescued After Crypto Ransom Kidnapping, Finger Sent as Threat While Media Kept Silent to Aid Efforts

Worth the Price? OpenAI’s Latest $200/Month AI Assistant Can Book Flights and Order Food

Hackers Hijack Nasdaq’s X Account to Promote Fake Memecoin STONKS

Ivanka Trump Decries Unapproved Meme Coin $IVANKA Bearing Her Name While TRUMP and MELANIA Dips Amidst Controversy

Two Japanese Men Face Jail for Profiting from AI-Generated NSFW Evangelion Artwork, Sparking Copyright Violations in Japan

MMA Fighter Allegedly Funded with $151K in Crypto Before Failed Moscow Park Shooting, Arrested on Suspicion of Murder