Microsoft's Applied Sciences team has created an on-device small language model (SLM), dubbed Phi Silica, designed to improve the battery, inference speed, and memory efficiency for Windows 11 Copilot+ PCs that use a Neural Processing Unit (NPU) capable of 40 trillion operations per second. Phi Silica runs several features of the latest Copilot+ PC including a turnkey pre-optimized SLM for developers to utilize. Phi Silica's context processing consumes only 4.8mWh of energy on the NPU and a 56% improvement in power consumption when compared to operation on the CPU.
Phi Silica is based on a Cyber-EO compliant derivative of Phi-3.5-mini, developed specifically for Windows 11. The model has a 4k context length and supports multiple languages, including Tier 1 languages. It includes key improvements necessary for in-product experiences. A language model consists of several components such as tokenizer, detokenizer, embedding model, transformer block and language model head.
Microsoft and academic researchers have created QuaRot to enable true low-precision inference by quantizing both weights and activations down to 4-bits. The result is a model that delivers, with the bulk of compute offloaded. Key techniques Microsoft employed include weight sharing, memory-mapped embeddings, and disabling the arena allocator to optimize the memory efficiency of the model.
Expanding the context length beyond the sequence length of the context processor was crucial for enabling real-world applications. To expand the context length and enable streaming prompt processing, Microsoft employed the sliding window approach by processing smaller chunks of size N with the padding applied to the last chunk if necessary. Microsoft also came up with dynamic and shared KV cache for context processing, which helps improve memory efficiency.
The resulting Phi Silica model features improved first token latency for shorter prompts and improves memory efficiency, while retaining most of the power-efficiency afforded by a largely NPU-based operation. The floating point model from which Phi Silica is derived has undergone safety alignment using a five-stage 'break-fix' methodology. The Phi Silica model, the system design and the API undergo responsible AI impact assessment and deployment safety board reviews.
Microsoft's Phi Silica pushes the boundaries of what's possible with today's NPUs in a rapidly evolving, complex technical landscape. By advancing quantization research, Microsoft achieved remarkable gains in three critical areas: memory efficiency, power efficiency, and inference latencies without compromising quality or functionality.