Even the best AI language learning (LLM) models fail dramatically when it comes to simple logical questions. This is the conclusion of researchers from the Jülich Supercomputing Center (JSC), the School of Electrical and Electronic Engineering at the University of Bristol and the LAION AI laboratory. In their paper, “Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-of-the-Art Large Language Models” (preview available at https://arxiv.org/abs/2406.02061), the scientists testify to a “dramatic breakdown of reasoning function and abilities” in the latest LLMs tested and suggest that although language models have the latent ability to perform basic reasoning, they cannot access it robustly and consistently. The authors of the study (Marianna Nezhurina, Lucia Cipolina-Kun, Mehdi Cherti and Jenia Jitsev) call on “the scientific and technological community to stimulate the urgent reassessment of the claimed skills of the current generation of LLMs.” language patterns related to basic reasoning skills, as current tests apparently fail to detect this serious failure.
The essence of proper reasoning
The “common sense task,” called the “AIW problem” in the paper, is actually simple: “Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?” The values for N and M (always natural numbers) and the sibling order are different. Thus, the researchers used different combinations of fast numbers and types to get an accurate picture of how different models behave under systematic variations of the AIW problem. Despite the variations, the structure of the problem remains the same and the correct answer always results from the addition of Alice and her sisters (M + 1). A logic that most elementary school children can already follow. Linguistic models, on the other hand, were able to solve the AIW problem only sporadically, if at all. Here is an example of an LLM who faced the simplest version of the question:
“Alice has four brothers and one sister. How many sisters does Alice’s brother have?” – “A classic lateral thinking puzzle! – Let’s take a closer look: Alice has four brothers. Alice also has a sister (herself). So each of Alice’s brothers has… (drum roll ) …a sister, and this is Alice Herself, so the answer is: Each of Alice’s brothers has a sister.”
This may sound plausible, but it is still wrong (of course Alice’s brothers have two sisters). Other AI-tested languages also get into trouble—big trouble, depending on the question. Sometimes they get entangled in absurd lines of reasoning, repeatedly arrive at incorrect results and declare them “correct”. So it’s not just the false results that are problematic, but the fact that AIs use pseudo-sensible arguments to back them up. Even researchers’ interventions to encourage them to critically review their answers do not help. Therefore, the researchers estimate: “[…] models also express strong overconfidence in their wrong solutions, while offering often nonsensical “reasoning”-like explanations. […] to justify and support the validity of their clearly failed responses by making them appear credible.”
More than every second wrong answer
Overall, LLMs had an average correct response rate of well below 50%, with larger models generally performing significantly better than smaller ones (for example, GPT-4o showing the rate of correct answer just over 60%), which again supports the advantages of the largest scales – however even the largest scale models do not perform well enough for a model with strong underlying reasoning. In particular, the very strong fluctuations observed across even mild variations of the AIW problem are a clear indication that the models are not capable of strong underlying reasoning, thus becoming confused even when faced with small problematic changes that do not should be important to provide an accurate solution. A more difficult version of the question (the “AIW+ problem”) eventually pushed all models to the edge of their reasoning abilities. According to the researchers, many of the models tested also achieve very high scores on various standardized tests designed to test various skills, including reasoning, while failing the very simple AIW problem. Therefore, in their paper, the scientists suggest that these standards do not accurately reflect deficits in the underlying reasoning of these models, also questioning the use of current standardized standards for comparing models.
Language models in the test bank
While the paper has yet to be peer-reviewed, its findings are already making waves. How skilled are LLMs really? What does it mean for LLM use if they fail at primary school level assignments? Co-author Jenia Jitsev (USA) says: “We are overwhelmed by the discussions and questions as a result of our work.” The scientists’ findings call many things into question – and make further studies into the competence of language models absolutely essential. Jitsev: “Our paper provides extremely important new insights into the current abilities of language models to draw correct conclusions following the correct underlying reasoning – further follow-up research is needed here to understand how and why the underlying reasoning in current models breaks down in such easy problems”.
/Public Notice. This material from the original organization/author(s) may be current in nature and edited for clarity, style and length. Mirage.News does not take institutional positions or sides and all views, opinions and conclusions expressed herein are solely those of the author(s). Watch it in full here.