Microsoft has introduced Phi-4-mini-flash-reasoning, a compact AI model optimized for fast on-device logical reasoning in low-latency environments like mobile apps and edge deployments, delivering up to 10 times throughput improvement and reduced latency.
The new model utilizes a 'decoder-hybrid-decoder' architecture called SambaY, incorporating state-space models, sliding window attention, and a novel Gated Memory Unit (GMU) for enhanced decoding efficiency and long-context performance.
Phi-4-mini-flash-reasoning outperforms larger models on tasks like AIME24/25 and Math500 while maintaining faster response times on the vLLM inference framework, making it suitable for real-time tutoring tools and adaptive learning apps.
The model aligns with Microsoft’s responsible AI principles, with safety features like supervised fine-tuning and reinforcement learning from human feedback. It is available through Azure AI Foundry, Hugging Face, and the NVIDIA API Catalogue.