Google releases Gemma 4 12B for local multimodal agents

Google released Gemma 4 12B on June 3, 2026. It is a 12B-parameter open model designed to run multimodal and agentic workloads locally, sitting between Google’s edge-friendly Gemma 4 E4B model and its larger 26B Mixture-of-Experts model. Google says it can run on laptops with 16GB of VRAM or unified memory, supports native audio inputs, and avoids separate multimodal encoders.

That makes the release different from Google’s recent Gemini stories. Gemini 3.5 Flash is about the economics of hosted frontier inference. Gemini Omni is about generated video inside Google products. Gemma 4 12B is about moving useful multimodal capability onto local hardware that developers can actually touch.

The local target is the news

The headline is not just model size. It is the hardware target. Google says Gemma 4 12B is small enough for local use on consumer laptops with 16GB of RAM, VRAM, or unified memory depending on the setup. That puts it in the part of the market where developers can test an agent, voice workflow, document parser, or local assistant without starting from a cloud endpoint.

The model is available through Hugging Face and Kaggle, and Google points developers toward LM Studio, Ollama, the Google AI Edge Gallery app, LiteRT-LM CLI, Transformers, llama.cpp, MLX, SGLang, vLLM, and Unsloth. That ecosystem matters because local models are adopted through tooling, not press releases. A model that requires fragile custom setup will not travel far, even if the weights are public.

The practical read is simple: Gemma 4 12B is a testable local baseline for multimodal agents. It is not a replacement for every hosted model, but it gives teams a serious option when privacy, latency, offline use, or cost make cloud inference awkward.

Unified means no separate multimodal encoders

Google calls Gemma 4 12B a unified, encoder-free multimodal model. On the model card, the 12B Unified model has 11.95B total parameters, 48 layers, a 1024-token sliding window, and a 256K-token context length. It supports text, image, and audio inputs directly; video is handled by processing frames.

The architecture detail matters because multimodal systems often bolt separate encoders onto a language model. Google says Gemma 4 12B removes the dedicated vision and audio encoders, projecting raw image patches and audio waveforms into the language model’s embedding space through lightweight linear layers. In theory, that should reduce deployment size and multimodal latency.

That is Google’s claim, and it is the right claim to test. The useful question is not whether the architecture sounds elegant. It is whether local image, audio, and video tasks become fast enough and stable enough for a real workflow on the machine you actually use.

The benchmark claim is narrower than the launch line

Google says Gemma 4 12B approaches the larger 26B MoE model on standard benchmarks while using less than half the total memory footprint. The public model card backs the direction of that claim, but the gap depends on the task. On some benchmarks 12B is close to 26B; on others the larger model is still clearly ahead.

Benchmark	Gemma 4 E4B	Gemma 4 12B Unified	Gemma 4 26B A4B
MMLU Pro	69.4%	77.2%	82.6%
LiveCodeBench v6	52.0%	72.0%	77.1%
MMMU Pro	52.6%	69.1%	73.8%
MRCR v2 8 needle 128K	25.4%	43.4%	44.1%

Those are Google/Hugging Face model-card numbers, not independent evals. They still show why this release is interesting: the 12B model is a large step up from E4B and close enough to the 26B MoE model on several practical benchmarks to be worth local testing.

12B Model size 12B Unified model Hugging Face model card

16GB Local memory target VRAM or unified memory, per Google Google

256K Context length Listed for Gemma 4 12B Unified Hugging Face model card

The limits are part of the product

Gemma 4 12B accepts audio, but it does not generate audio. The model card describes Gemma 4 as handling multimodal input and generating text output. It also lists practical media limits: audio up to 30 seconds, and video up to 60 seconds when processed at one frame per second.

The model card also includes the usual but important caveat for open models: it is not a knowledge base. It can produce incorrect or outdated factual statements, reflect gaps in training data, and require downstream safeguards for sensitive use cases. Local execution does not remove responsibility from the developer; it moves more of that responsibility into the application.

That matters most for agentic use. Function calling, long context, and multimodal inputs make a better local agent possible, but they also create more ways for a model to act on bad context or uncertain output. If a local agent can read a PDF, inspect an image, transcribe audio, call tools, and write code, the evaluation needs to cover the whole loop.

What teams should test first

The first test is memory, not benchmark score. Run the instruction-tuned checkpoint you plan to use, with the context length and media inputs your workflow actually needs. A 16GB headline does not mean every long-context, multimodal, high-throughput workload will feel good on every laptop.

The second test is latency under the real toolchain. Compare LM Studio, Ollama, llama.cpp, MLX, Transformers, or vLLM on the same task before judging the model. Local inference performance is as much about the runtime as the weights.

The third test is multimodal reliability. Give the model representative PDFs, screenshots, images, short audio clips, and frame-sampled videos. Check whether it extracts the right details, refuses when it lacks enough information, and handles follow-up questions without drifting.

Gemma 4 12B is not the new best model in every setting. It is a new local baseline worth adding to the shortlist. For teams choosing between hosted and local systems, that is enough to matter.

For broader model context, see our AI model leaderboard. For Google’s wider AI strategy, see our Google company profile.