Can we trust AI to judge AI? And I'm speaking at Devworld!

Posted on

This week’s newsletter dives into the challenge of evaluating LLM output—how can we trust AI to judge AI. I will be speaking about this topic at Devworld conference in Amsterdam this week! Related to this topic: how can we make AI forget incorrect information and behavior? Plus, I’m excited to have Robert join me on this newsletter journey! Together, we’ll explore even more AI topics :)

I’m talking at DEVWorld next week in Amsterdam!

I got invited to speak at DEVWorld 2025 and will talk about how we can trust AI and my experience as well as struggles when it comes to doing LLM Evaluation. If you are there, don’t shy away to come say hello! :-)

Click to enlarge

The challenge of LLM Evaluation

The more AI is used, the more opportunity there is for catastrophic failure. We’ve already seen many failures in the short time that foundation models have been around. A man committed suicide after being encouraged by a chatbot.

I think evaluation is one of the hardest, if not the hardest, challenges of AI engineering.

Due to the cumbersome and unscalable nature of human evaluation, LLMs are increasingly used to evaluate LLM-generated outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation.

I am somewhat skeptical about LLM-based evaluation but what is the alternative? If we ask humans to check every single LLM output, we are simply trading one manual task with another, killing our use case.

Question is, how do we evaluate the quality of the LLM evaluator?

Evaluate the LLM Evaluator

One way to do it to is make sure LLM evaluator aligns with human evaluators. We want to compare human annotations of LLM output against automated (LLM-based) evaluations to measure how well LLM evaluation aligns with human judgment. A high alignment score gives the confidence in accepting the quality of our LLM system.

The goal behind this is not just evaluating the LLM’s output, but also evaluating how well the LLM itself can be used as an evaluator (LLM-based evaluation)

Eugene Yan created an excellent flowchart to guide you with choose the right evaluation metrics for assessing LLM evaluators as below.

Click to enlarge

Here is the link toEugene’s blog post. It is a long read but you will thank yourself for reading it. Enjoy!

Unlearning AI: Making AI Forget Its Mistakes

Large language models are becoming more powerful, but so is their ability to generate convincing yet incorrect responses. Moreover, their behaviors and values might not always align with ours. Traditional training methods like reinforcement learning with value alignment help steer AI toward desirable behaviors, but they don’t remove incorrect or harmful knowledge, meaning models can still hallucinate and generate unsafe content.

But what if we could make AI forget its mistakes altogether?

A new approach called machine unlearning aims to surgically remove specific pieces of knowledge from AI models, ensuring they no longer retain or reproduce unwanted information. While unlearning has been explored in computer vision models, it’s still an emerging field for large language models.

In a recent Nature publication (free arXiv version here), Liu and colleagues outline various techniques to achieve LLM unlearning, including:

  1. Fine-tuned model editing – Retraining the model on a corrected dataset to override faulty knowledge.
  2. Negative gradient updates – Identifying harmful training examples and reversing their influence on the model.
  3. Weight surgery – Modifying the model’s neurons or layers responsible for incorrect behavior.
  4. Reinforcement learning with forgetting – Penalizing AI for recalling specific unwanted facts.

Despite these advances, there’s no perfect unlearning method yet. Key challenges remain, such as verifying whether unlearning was successful, preventing the model from relearning forgotten information, and ensuring selective forgetting doesn’t degrade overall performance.

Just a few days ago, Perplexity released a uncensored version of Deepseek R1. While the original model would either refuse to answer, or answer incorrectly in accordance to the Chinese government censorship policy (e.g. anything related to the independence of Taiwan), the new finetuned model will answer all questions willingly and truthfully. The unlearning was performed by finetuning R1 on a curated dataset of about 40.000 prompts consisting of over 300 sensitive topics.

Why is unlearning so important? In the words of perplexity:

“At Perplexity, we aim to provide accurate answers to all user queries. This means that we are not able to make use of R1’s powerful reasoning capabilities without first mitigating its bias and censorship.”

What we were watching and reading this week

  • AI Engineering book by Chip Huyen. I am 6 chapters in this book. It discusses the process of building applications with available foundational models, everything from prompt engineering, RAG, Agent to evaluation. It also introduces the practical framework for developing AI applications and deploying it. Full review when I am done.

  • AI as a co-scientist? Google just released a new agent to help scientists to create new ideas and draft research proposals.|

Join Our Newsletter

Subscribe

* indicates required

Intuit Mailchimp