Teaching Gemma-3 to Reason via GRPO

Three-stage SFT plus GRPO pipeline for reasoning improvement.

Built an end-to-end fine-tuning pipeline to improve reasoning quality for Gemma-3 using supervised fine-tuning and GRPO.

  • Improved answer accuracy by 12% and reasoning quality by 47%.
  • Designed a custom weighted reward function targeting logical coherence, mathematical precision, and strict XML compliance.
  • Structured training as a staged workflow for more stable optimization.