Podcast Episode
The release fills a long-standing gap in the Gemma 4 family, which launched in April with four variants: the edge-optimised E2B and E4B models, a 26B Mixture of Experts configuration, and a 31B Dense version. Those earlier models leaned on vision transformer layers and conformer-based audio encoders. The new 12B variant scraps both in favour of what Google calls a "Unified" architecture.
The practical payoff is twofold. Latency drops, because the model can start processing inputs without waiting for encoder pipelines to finish. Fine-tuning also gets simpler, since a single LoRA pass can update vision, audio, and text weights at the same time.
Google Launches Encoder-Free Gemma 4 12B, Bringing Multimodal AI to 16GB Laptops
June 4, 2026
0:00
5:58
Google DeepMind has released Gemma 4 12B, an open-source multimodal model that processes text, images, and audio without dedicated encoders. It fits within 16GB of memory, making see-and-hear AI inference possible on standard consumer hardware. Google says it nears the performance of its 26B variant at less than half the memory footprint.
Google Drops the Encoders
Google DeepMind has released Gemma 4 12B, an open-source multimodal model that handles text, images, and audio without relying on dedicated encoder modules. It is the first mid-sized open-weight model to take this approach. The 12-billion-parameter model runs within 16GB of VRAM or unified memory, putting multimodal AI inference within reach of ordinary consumer laptops rather than data-centre hardware.The release fills a long-standing gap in the Gemma 4 family, which launched in April with four variants: the edge-optimised E2B and E4B models, a 26B Mixture of Experts configuration, and a 31B Dense version. Those earlier models leaned on vision transformer layers and conformer-based audio encoders. The new 12B variant scraps both in favour of what Google calls a "Unified" architecture.
How the Encoder-Free Design Works
In a conventional multimodal model, separate encoder modules translate images and audio into a usable form before the language model backbone ever sees them. Gemma 4 12B removes that step. The vision encoder, typically 15 to 27 transformer layers, is replaced by a lightweight 35-million-parameter embedding module that projects raw pixel patches straight into the model's token space using a single matrix multiplication with factorised 2D positional embeddings. Audio takes a similar path: raw 16 kHz waveforms, sliced into 40-millisecond frames, are projected directly into the same dimensional space as text tokens, bypassing any separate speech recognition encoder.The practical payoff is twofold. Latency drops, because the model can start processing inputs without waiting for encoder pipelines to finish. Fine-tuning also gets simpler, since a single LoRA pass can update vision, audio, and text weights at the same time.
Performance and Availability
Google says the 12B model approaches the performance of the larger 26B MoE variant on standard benchmarks at less than half the memory footprint, with reported scores of 77.2% on MMLU Pro and 78.8% on GPQA Diamond. The model ships under the permissive Apache 2.0 licence, with day-one support across llama.cpp, vLLM, MLX, Ollama, LM Studio, and Unsloth.A Local-First Push
The launch dovetails with Google's expanding local-first tooling for macOS. An open-source Electron app called Gemma Chat runs Gemma 4 models on Apple Silicon Macs through Apple's MLX framework and now supports the 12B variant. It offers a coding agent mode and a conversational mode with voice input powered by on-device speech-to-text, keeping every prompt and generated response on the user's machine.Published June 4, 2026 at 6:08pm