Existing infrastructure faces resource, deployment, application orchestration, and platform issues when applied to AI computing.
Ray is a general-purpose distributed computing engine adopted by many large companies, providing simple and intuitive distributed programming, extensive AI framework integration, and efficient scaling.
A platform based on Ray called AstraRay was built to address the challenges of low-cost, high-throughput, high-reliability, and easy-to-use AI computing.
The architecture of AstraRay addresses challenges with managing million-scale pod clusters, ensuring stability with unstable resources, and simplifying application deployment.
To support scaling to millions of nodes, AstraRay adopts a shared scheduling architecture, resolving resource allocation conflicts using optimistic concurrent scheduling locks.
AstraRay enables quick handling of unstable nodes and efficient scheduling through fast disaster recovery scheduling and dynamic weighted SWRR routing algorithm.
AstraRay simplifies AI application deployment through approaches such as multi-model extension, fast model distribution, multi-module extension, and multi-hardware extension.
AstraRay addresses challenges related to multi-hardware extension and diverse inference business types through building on the TFCC framework.
AstraRay has already established a solid foundation for AI applications in production environments and continues to undergo optimization and improvements.