Created during March 2025
Image extraction with Google Gemma3
Google just released Gemma 3, a family of lightweight multimodal models delivering performance comparable to larger models while running on a single GPU or TPU.
The model family comes in four sizes (1B, 4B, 12B, and 27B parameters). The 27B model ranks second on LMArena, just behind DeepSeek-R1, outperforming larger models like Llama-405B, DeepSeek-V3, and o3-mini, making it one of the strongest open models available.
So of course I have to test it out. I quickly built an app that extracts text from images using gemma-3-4b and gemma-3-12b on my 2024 M3 MacBook Air.
You can pull the model from Ollama and create a Streamlit app in just a few lines of code. The 4B model ran rather fast but still generated quite some inaccuracies, so definitely not something I would use.
I attempted to pull the 12B model but got into memory issues. With some patience I got it running on a new macbook (though I had to turn off all my apps to release enough RAM :p) and that one performed definitely better but still not as good as I hoped it would. Still some inaccuracies here and there.
Still, running on just a single GPU, these models hit a sweet spot of being open-source, multimodal, small and fast enough to be deployed across devices.