Teaching Gemma-3 to Reason via GRPO

Built an end-to-end fine-tuning pipeline to improve reasoning quality for Gemma-3 using supervised fine-tuning and GRPO.

Improved answer accuracy by 12% and reasoning quality by 47%.
Designed a custom weighted reward function targeting logical coherence, mathematical precision, and strict XML compliance.
Structured training as a staged workflow for more stable optimization.