menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Microsoft News

>

Phi Silica...
source image

Windows

2w

read

401

img
dot

Image Credit: Windows

Phi Silica, small but mighty on-device SLM

  • Microsoft's Applied Sciences team has created an on-device small language model (SLM), dubbed Phi Silica, designed to improve the battery, inference speed, and memory efficiency for Windows 11 Copilot+ PCs that use a Neural Processing Unit (NPU) capable of 40 trillion operations per second. Phi Silica runs several features of the latest Copilot+ PC including a turnkey pre-optimized SLM for developers to utilize. Phi Silica's context processing consumes only 4.8mWh of energy on the NPU and a 56% improvement in power consumption when compared to operation on the CPU.
  • Phi Silica is based on a Cyber-EO compliant derivative of Phi-3.5-mini, developed specifically for Windows 11. The model has a 4k context length and supports multiple languages, including Tier 1 languages. It includes key improvements necessary for in-product experiences. A language model consists of several components such as tokenizer, detokenizer, embedding model, transformer block and language model head.
  • Microsoft and academic researchers have created QuaRot to enable true low-precision inference by quantizing both weights and activations down to 4-bits. The result is a model that delivers, with the bulk of compute offloaded. Key techniques Microsoft employed include weight sharing, memory-mapped embeddings, and disabling the arena allocator to optimize the memory efficiency of the model.
  • Expanding the context length beyond the sequence length of the context processor was crucial for enabling real-world applications. To expand the context length and enable streaming prompt processing, Microsoft employed the sliding window approach by processing smaller chunks of size N with the padding applied to the last chunk if necessary. Microsoft also came up with dynamic and shared KV cache for context processing, which helps improve memory efficiency.
  • The resulting Phi Silica model features improved first token latency for shorter prompts and improves memory efficiency, while retaining most of the power-efficiency afforded by a largely NPU-based operation. The floating point model from which Phi Silica is derived has undergone safety alignment using a five-stage 'break-fix' methodology. The Phi Silica model, the system design and the API undergo responsible AI impact assessment and deployment safety board reviews.
  • Microsoft's Phi Silica pushes the boundaries of what's possible with today's NPUs in a rapidly evolving, complex technical landscape. By advancing quantization research, Microsoft achieved remarkable gains in three critical areas: memory efficiency, power efficiency, and inference latencies without compromising quality or functionality.

Read Full Article

like

24 Likes

For uninterrupted reading, download the app