MLPerf Mobile v6.0 gives on-device LLMs a real test surface

MLCommons has released MLPerf Mobile v6.0 with new generative AI benchmarks for on-device large language models. The update adds standardized Android tests for Llama 3.2 1B Instruct, Llama 3.2 3B Instruct, and Llama 3.1 8B Instruct.

That is a useful step for phone AI because the market has been heavy on demos and light on comparable tests. Vendors can show a model summarizing, translating, or answering a prompt on a device, but buyers and developers still need a way to compare latency, accuracy, memory limits, and acceleration across chips and phones.

MLPerf Mobile v6.0 does not solve every evaluation problem. It does create a more concrete surface for testing small and mid-sized language models locally, inside an app that already includes other mobile AI benchmarks.

The benchmark now includes LLM tasks

MLCommons says the new LLM tests use requests selected from TinyMMLU and IFEval to quantify performance and accuracy for on-device AI inference. TinyMMLU gives the benchmark a small knowledge-and-reasoning task set. IFEval tests whether a model follows explicit instructions.

Those tasks are not the whole phone-assistant workload, but they are a better start than measuring only raw token speed. A local model that produces tokens quickly but fails basic instruction-following will not be useful for device workflows. A model that is accurate but too slow or memory-heavy may also fail in practice.

The model choices are practical. Llama 3.2 1B and 3B Instruct fit the small-model tier where on-device execution is realistic for more phones. Llama 3.1 8B Instruct is heavier and creates a test for higher-end devices and acceleration paths.

NPU acceleration is part of the test

MLCommons says the LLM tests can run on devices with sufficient memory using the CPU without tailored acceleration. It also says v6.0 supports NPU-accelerated execution of Llama 3.1 8B Instruct on Qualcomm Snapdragon 8 Elite Gen 5 systems on chips. The working group plans to expand LLM acceleration support to more devices and platforms.

That distinction matters. On-device AI is not only a model-size story. It is a hardware-software integration story: memory, thermal limits, neural processing units, quantization, runtime support, and battery behavior. A benchmark that can expose whether an NPU path is available and useful gives developers better evidence than a vendor slide.

The release also adds support for devices based on the MediaTek Dimensity 9500 Series and updates support for Qualcomm Snapdragon 8 Elite Gen 5 and Samsung Exynos 2600.

Phone AI needs comparable baselines

The industry is moving more AI work onto devices for privacy, latency, cost, and offline reliability. That does not mean every model should run locally. It means developers need to decide which tasks are worth keeping on device and which should still go to cloud models.

A standardized mobile benchmark helps with that routing. If a phone can handle basic instruction-following quickly and privately, a note summarizer, local classifier, keyboard helper, or simple assistant action may not need a remote model. If the device struggles, the app can reserve local inference for lower-risk tasks or use the cloud when the user allows it.

The benchmark is also useful for procurement. Enterprises that issue phones or build mobile apps need evidence beyond “AI-ready.” They need to know which devices can run which models, at what quality, and with what acceleration support.

Do not overread it

The release is about benchmark coverage, not a claim that phone LLMs have reached frontier quality. TinyMMLU and IFEval are useful signals, but real assistants also need tool use, retrieval, privacy controls, multilingual behavior, safety handling, personalization, and graceful fallback.

The best use of MLPerf Mobile v6.0 is comparative: which phone, chip, runtime, and model combination handles the standardized tasks better? It should not be the only test an app team runs before shipping local AI features.

The next useful data will be public result tables and independent tests across commercial devices. Once those appear, the conversation can move from “this phone has an NPU” to “this phone can run this model at this speed and accuracy under this benchmark.”

For readers tracking model capability and deployment trade-offs, see our AI model leaderboard and AI company tracker.

Sources

MLCommons: MLPerf Mobile v6.0 with new generative AI benchmarks