Three abstract model blocks run through benchmark lanes on a smartphone circuit board with mobile chips around the device
Three abstract model blocks run through benchmark lanes on a smartphone circuit board with mobile chips around the device
+ Large Language Models News

MLPerf Mobile v6.0 gives on-device LLMs a real test surface

MLCommons added standardized Android LLM tests to MLPerf Mobile v6.0, including Llama 3.2 1B, Llama 3.2 3B, and Llama 3.1 8B Instruct workloads.

about 4 hours ago

MLCommons has released MLPerf Mobile v6.0 with new generative AI benchmarks for on-device large language models. The update adds standardized Android tests for Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, and Llama 3.1 8B Instruct.

That is a useful step for phone AI because the market has been heavy on demos and light on comparable tests. Vendors can show a model summarizing, translating, or answering a prompt on a device, but buyers and developers still need a way to compare latency, accuracy, memory limits, and acceleration across chips and phones.

MLPerf Mobile v6.0 does not solve every evaluation problem. It does create a more concrete surface for testing small and mid-sized language models locally, inside an app that already includes other mobile AI benchmarks.

The benchmark now includes LLM tasks

MLCommons says the new LLM tests use requests selected from TinyMMLU and IFEval to quantify performance and accuracy for on-device AI inference. TinyMMLU gives the benchmark a small knowledge-and-reasoning task set. IFEval tests whether a model follows explicit instructions.

Those tasks are not the whole phone-assistant workload, but they are a better start than measuring only raw token speed. A local model that produces tokens quickly but fails basic instruction-following will not be useful for device workflows. A model that is accurate but too slow or memory-heavy may also fail in practice.

The model choices are practical. Llama 3.2 1B and 3B Instruct fit the small-model tier where on-device execution is realistic for more phones. Llama 3.1 8B Instruct is heavier and creates a test for higher-end devices and acceleration paths.

NPU acceleration is part of the test

MLCommons says the LLM tests can run on devices with sufficient memory using the CPU without tailored acceleration. It also says v6.0 supports NPU-accelerated execution of Llama 3.1 8B Instruct on Qualcomm Snapdragon 8 Elite Gen 5 systems on chips. The working group plans to expand LLM acceleration support to more devices and platforms.

That distinction matters. On-device AI is not only a model-size story. It is a hardware-software integration story: memory, thermal limits, neural processing units, quantization, runtime support, and battery behavior. A benchmark that can expose whether an NPU path is available and useful gives developers better evidence than a vendor slide.

The release also adds support for devices based on the MediaTek Dimensity 9500 Series and updates support for Qualcomm Snapdragon 8 Elite Gen 5 and Samsung Exynos 2600.

Phone AI needs comparable baselines

The industry is moving more AI work onto devices for privacy, latency, cost, and offline reliability. That does not mean every model should run locally. It means developers need to decide which tasks are worth keeping on device and which should still go to cloud models.

A standardized mobile benchmark helps with that routing. If a phone can handle basic instruction-following quickly and privately, a note summarizer, local classifier, keyboard helper, or simple assistant action may not need a remote model. If the device struggles, the app can reserve local inference for lower-risk tasks or use the cloud when the user allows it.

The benchmark is also useful for procurement. Enterprises that issue phones or build mobile apps need evidence beyond “AI-ready.” They need to know which devices can run which models, at what quality, and with what acceleration support.

Do not overread it

The release is about benchmark coverage, not a claim that phone LLMs have reached frontier quality. TinyMMLU and IFEval are useful signals, but real assistants also need tool use, retrieval, privacy controls, multilingual behavior, safety handling, personalization, and graceful fallback.

The best use of MLPerf Mobile v6.0 is comparative: which phone, chip, runtime, and model combination handles the standardized tasks better? It should not be the only test an app team runs before shipping local AI features.

The next useful data will be public result tables and independent tests across commercial devices. Once those appear, the conversation can move from “this phone has an NPU” to “this phone can run this model at this speed and accuracy under this benchmark.”

For readers tracking model capability and deployment trade-offs, see our AI model leaderboard and AI company tracker.

Sources

The AI Feed Desk

The AI Feed Desk

Editorial desk

The AI Feed Desk tracks AI provider updates, model releases, agent tooling, and enterprise adoption, turning fast-moving announcements into source-linked context for builders and operators.

Noticed a typo, incorrect information, or translation error?

Tell us so we can fix it.

Help Improve This Article

Related Articles

Gemini 3.5 Flash beats last year's Pro on the work builders ship

Google's Gemini 3.5 Flash beats last year's 3.1 Pro on coding and agentic benchmarks at ~40% lower cost — with reasoning and 1M-context limits worth testing.

The AI Feed Desk

By The AI Feed Desk

Google releases Gemma 4 QAT checkpoints for local models

Google's Gemma 4 QAT release adds Q4_0 and mobile checkpoints, cutting Gemma 4 E2B to a 1GB memory footprint for local and on-device use.

The AI Feed Desk

By The AI Feed Desk

OpenAI puts o3 and GPT-4.5 on a ChatGPT sunset clock

OpenAI will retire GPT-4.5 from ChatGPT on June 27 and OpenAI o3 on August 26, with no API change. Teams should audit model-specific workflows now.

The AI Feed Desk

By The AI Feed Desk

MAI-Code-1-Flash moves across GitHub Copilot before enterprise access

GitHub says Microsoft's small coding model is expanding across Copilot CLI, app, chat, IDE, mobile, and Xcode surfaces before Business and Enterprise rollout.

The AI Feed Desk

By The AI Feed Desk

about 5 hours ago

GPT-5.5 Instant makes health a default ChatGPT test

OpenAI says GPT-5.5 Instant improves ChatGPT health responses for free users, with physician rubrics, HealthBench evaluations, and production factuality monitoring.

The AI Feed Desk

By The AI Feed Desk

about 5 hours ago