Google introduces Gemma 3, a lightweight model family, ranking 2 on Elo score, just behind DeepSeek-R1
Posted on
Google just released Gemma 3, a family of lightweight multimodal AI models built from the same technology as Gemini 2.0 — delivering performance as larger models but can run on a single GPU or TPU.

The model family comes in four sizes (1B, 4B, 12B, and 27B parameters). The 27B model ranks second on LMArena, just behind DeepSeek-R1, outperforms larger models like Llama-405B, DeepSeek-V3, and o3-mini, making it one of the strongest open models available.
So of course I have to test it out. I quickly built an app that extract text from images, I tested it on my personal laptop which is a 2024 M3 Macbook air. You can pull the model from ollama and quickly create a streamlit app in less than 50 lines of code. If you want to create the app yourself, please see the code here. The 4B was quite fast but still generate some inaccuracy here and there.

Running on just a single GPU, these models hit a sweet spot of being open-source, powerful, multimodal, small and fast enough to be deployed across devices.
Mistral OCR - The world best’s document understanding API?
Mistral OCR fills the new last week. Therefore we tested it out so you don’t have to!
It is an OCR API that can understand each element of documents—media, text, tables, equations and takes images and PDFs as input and extracts content.
Is Mistral OCR impressive? Yes definitely!! Is it the world’s best document understanding API? We are not sure!
We compare Mistral OCR performance with few other stars like Docling, Adobe PDF Extraction API and Agentic document extraction (from LandingAI by Andrew Ng) that we mentioned in our previous newsletter.
Our test document contains lots of tables and all these packages managed to extract most of the numbers correctly, which is super impressive.
However, Landing AI, docling or Mistral OCR got some wrong numbers here and there. Only my favourite Adobe PDF Extraction API never lets me down! :) In the following example, really curious how come Mistral’s OCR messed up reading the number “3” only once in that table, while all the others were read correctly :')

For image handling, both Mistral and Landing AI extract text content from images. In my example, Mistral OCR does not parse the image at all while LandingAI does a great job and extracts all the right numbers:


In any case, I think they are all super impressive. What do you think? What is your go-to tool for parsing document?
Will 2025 be the year of AI agents?
2025 seem to be all about AI agents. Almost every day, new frameworks, new tools and new dedicated APIs for agentic LLMs are announced. Agents however, have been around for quite some time. Already in the 80s people were thinking about agentic document retrievals, and early strategy video games in the 90s had computer opponents. More recently we had big advances with AlphaFold, that could accurately predict the structure and properties of proteins, and with human-like robotics by Boston Dynamics and many others.
The focus with AI agents seem to shift towards language models. By giving them access to various tools, we empower them to interact with a bigger environment and work towards achieving complicated goals. But again, giving a language model access to tools is not that new: openAI already added tooling as a feature in the summer of 2023.
So what makes 2025 so different from the other years?
Let’s first discuss why we would like to use agentic language models in the first place. While for example self driving cars and walking robots are examples of agentic AI, they require large amounts of training and in the end can only do what they were trained on. Language models on the other hand, have the potential of generalizing to any applications with just a few shots of prompting. There are also a few other qualities of large language models that make them attractive as AI Agents: They have a sense of autonomy, where they can come up with their own ideas without direct human intervention. The nature of prompting allows the AI react quickly to new information. And not only can they react to the new information, they can use this information to take initiative by either asking new questions, reason, and make plans.
Letting a language model interact with its environment, is usually done through tooling. While empowering AI with tool-usage has been around since mid 2023, AI so far always had difficulty using tooling in an accurate way: either because the AI wasn’t smart enough to use the tool when needed, or due to hallucinations the tool was used incorrectly. There were two game changers last year that removed this hurdle: structured outputs (where the AI is guaranteed to pass the correct format to the tool), and reasoning (think o3, R1, sonnet 3.7)
Researchers from Berkeley created a leaderboard to track how accurately AI can use tools and functions of different forms. The tests cover a wide range of tools, ranging from calling SQL commands, to abstract API calls, but also how well the AI can combine different tools to solve more complicated problems. The results are updated about every three weeks:

(where I added some jitter to show the distribution of the data, each dot in a column represent a different AI model). The results… don’t look that exciting. Although we see a small upward trend, early 2025 we are yet to pass the 80% accuracy threshold.
And although we have a few promising AI agents out there such as Manus AI, we need more than clever web interaction and code generation to have that personal AI assistant that we all wished for ;)
What we’ve been reading/watching this week
- A Disney worker downloaded an AI tool from GitHub that contained malware, leading to not only all his personal passwords being leaked, over 1 TB of Disney’s information as well.
- Sakana AI created a fully autonomously written paper by their AI, that passed peer review and got accepted for publishing in a journal.