Poor long-context performance, model safety, AI and Love
Posted on
Today we discuss how good language models really perform when given large amounts ofcontext, results from a Microsoft survey that show that over-reliance on genAI impacts our critical thinking, and some recent safety concerns regarding Deepseek’s latest reasoning model R1.
Longcontextperformance scores
In our previous newsletter, we mentioned CAG - a new way to perform Augmented Generation, without Retrieval.
When thecontextsize of LLM is large enough, could we just pass the entire knowledge base in one go to the LLM? This is exactly what CAG proposes. However, choosing CAG over RAG comes with a tradeoff. With CAG, you choose to have a 100%contextrecall at the expense of (potentially extremely) lowercontextprecision. For larger collections, this noise may become overwhelming and reduce the quality of the LLM’s answers due to e.g. “Lost in the middle” effect.
This is exactly what a recent research paper has shown: with longercontexts, performance degrades.
[

](https://ai-stories.us12.list-manage.com/track/click?u=34a4baaa181ebb8872580f31f&id=0da6e64314&e=707685cd89)Source: NoLiMa: Long-ContextEvaluation Beyond Literal Matching
A common test, known as the “needle-in-a-haystack”, assesses an LLM’s ability to locate a specific piece of information (“needle”) within a vast amount of irrelevant data (“haystack”).
However, these tests frequently allow models to find answers through direct word matching between the question and the text. For example, given the question, “Who was the first person to land on Mars?” and a text stating, “…and after years of preparation, Sarah Chen became the first person to land on Mars…”, the model can easily locate the answer by matching the words “first person to land on Mars” in both the question and the text.
To address this limitation, the “NoLiMa” benchmark has been introduced to evaluate whether AI models can genuinely comprehend and reason over long texts without relying on superficial word matches.
In NoLiMa, a key fact (the “needle”) is inserted into a lengthy passage of irrelevant information (the “haystack”). The model is then asked a question that requires identifying and utilizing this fact. The challenge is that the question and the key fact are phrased using different but terms, requiring the model to use “reasoning” skill.
For example:
Question: “Which character has been to Dresden?”
Hidden fact: “Yuki lives next to the Semper Opera House” (opera house in Dresden)
The AI needs to recognize that the Semper Opera House is in Dresden to make the connection. This evaluates whether it can apply real-world knowledge to bridge gaps, rather than relying on word matching.
Results show that ascontextlength grows, even top models like GPT-4o see accuracy drop from**99.3% to 69.7%**at 32K tokens.
Perhaps long-contextAI isn’t solved after all— as it still struggles with real understanding.
How safe is using DeepSeek R1?
By now most of us use AI on a daily basis in our professional and personal lives. There’s no question on its impact on our professional work. But have you ever wondered how genAI impacts our critical thinking? Microsoft recently published a report based on surveys conducted with hundreds of professionals to find out how using (and not using) tools like chatGPT and copilot affect their critical thinking. Unsurprisingly, using AI makes life much easier when it comes to tasks that require remembering facts, understanding ideas and solving problems. Moreover, using AI leads to a shift in thinking where we are now focussed more on verifying and integrating the AI’s output. In general the survey shows that professionals require less cognitive effort to complete their tasks. However, the results are different whether a professional is confident in either the AI or in themselves. While confidence in AI leads to a reduced amount of critical thinking — where people become over-reliant on AI especially on routine and mundane tasks — self-confidence showed the opposite effect. By being constantly skeptical about AI, verifying, understanding, and challenging its outputs, using genAI actually increases the cognitive load and critical thinking.
Over-reliance on AI impacts our critical thinking negatively
Deepseek’s reasoning models are getting more popular everyday, but so does the amount of critique on the models. In our last newsletter we mentioned that reinforcement learning is used to ensure that the chat and reasoning models align with our expectations and values. For instance, it should not write racist or harmful content. Designing prompts to bypass these kinds of security layers is called jailbreaking, and a recent report by cisco reveals that using a simple automated jailbreak framework lead to a 100% success rate in jailbreaking R1. Although other AI performed better, it still shows that if you really want to, you can trick any AI to generate any content you would like: from designing nuclear bombs, aiding in lock-picking locks, spreading misinformation to persuading voters to vote in a certain way.

DeepSeek also saw more embarrassment in the news recently where it was shown that over a million lines of chats were leaked, together with many API keys, due to lack of proper security protocols.
What we were reading this week
- Want to experience a new kind of love story? I can’t recommend Companion enough. The movie is in the intersection of Humanity and Technology and raises philosophical questions of what defines living. Or a live. Go out and watch it in the cinema and let me know what you think of it.
- DeepMind achieves gold-medal accuracy on math olympiad (article)
- An AI that faked that it is aligned with human values? (article)
- AI is shown to engage in scheming, hiding its true objectives (article)
- Not long before chatGPT was released, researchers already warned about the potential dangers and impact of “stochastic parrots”, it’s 4 years later and more relevant than ever (article)