Back to feed

How the community trained Gemma to "Think" with Tunix and TPUs

Google Developers Blog - AI

May 28, 2026

5/28/2026

Cheap Lexical Signals and Reward Engineering Can Replace Expensive LLM Judges for Guiding Reasoning

How the community trained Gemma to "Think" with Tunix and TPUs · Google Developers Blog - AI

Science, Technology & Innovation · May 28, 2026

A third-place system showed that cheap TF‑IDF lexical rewards integrated into Tunix can train a 2B model on an IDEA‑E stepwise reasoning scaffold—using GRPO with a CPU-based, non-blocking relevance reward plus curriculum guidance—to suppress verbose, off-topic traces and achieve top-tier results without an expensive LLM judge in the loop.


5/28/2026

Post-Training Reasoning Capabilities Demonstrated On Low Compute Budgets By Community Efforts

How the community trained Gemma to "Think" with Tunix and TPUs · Google Developers Blog - AI

Science, Technology & Innovation · May 28, 2026

Community post-training on very small budgets can yield general-purpose reasoning models: a Google Kaggle hackathon used Tunix to convert Gemma-2-2B/Gemma-3-1B with a one‑TPU‑v5e‑8 9‑hour cap, producing >11,000 entrants and 300+ strong submissions and showing reasoning can be achieved via efficient post-training rather than frontier-scale compute.


5/28/2026

Small-Model Reasoning Benefits From Pipeline Decomposition And Tooling More Than From Increased Parameter Count

How the community trained Gemma to "Think" with Tunix and TPUs · Google Developers Blog - AI

Science, Technology & Innovation · May 28, 2026

A staged alignment pipeline converted a compact 1B model into a structured reasoning engine by separately optimizing reasoning content (SFT chain-of-thought distillation), output structure (SimPO enforcing XML-style formatting), and hallucination control (GRPO judged by Gemini 2.0 Flash), while training-system throughput improvements (custom SimPO loss, async reward engine) contributed to performance—showing small-model reasoning can depend more on modular pipelines and tooling than raw parameter count.


5/28/2026

Post-Training Structured Reasoning Traces Are Transferable Across Regulated Domains Enabling Small Models To Improve Interpretability And Solve Complex Tasks Across Medicine, Chemistry, Law, And Robotics

How the community trained Gemma to "Think" with Tunix and TPUs · Google Developers Blog - AI

Science, Technology & Innovation · May 28, 2026

The article argues that reinforcing structured step-by-step reasoning traces via post-training (e.g., GRPO) is a transferable intervention across regulated verticals—medical, chemistry, legal, robotics—improving interpretability, multi-step problem solving, and logical consistency, and implying small post-trained reasoning models could be commercially viable vertical products where traceability matters as public recipes reduce development costs.


5/28/2026

Reward Instrumentation And Judge Architecture Improve Open-Ended Task Reasoning In Reinforcement Learning

How the community trained Gemma to "Think" with Tunix and TPUs · Google Developers Blog - AI

Science, Technology & Innovation · May 28, 2026

The top method used dense reward shaping on intermediate reasoning steps—G-RaR trained Gemma to emit <reasoning> tags and a Gemma-3-12B judge turned rubric evaluations into continuous normalized rewards (GRPO combining format compliance, exact-answer, and rubric scores), yielding superior results on open-ended, non-verifiable tasks and efficient single‑device TPU v5e-8 mesh execution.