The significant computational demands of pretrained language models (PLMs) pose a challenge in efficient inference, especially in multi-tenant environments.
HMI (Hierarchical knowledge management-based Multi-tenant Inference) is introduced as a system to manage tenants with distinct PLMs resource-efficiently.
HMI utilizes hierarchical PLMs (hPLMs) by categorizing PLM knowledge into general, domain-specific, and task-specific, reducing GPU memory usage per tenant.
System optimizations like hierarchical knowledge prefetching and parallel implementations improve resource utilization and inference throughput in HMI.