JobSet is an open source API introduced by Daniel Vega-Myhre, Abdullah Gharaibeh, and Kevin Hannon for representing distributed jobs, focusing on distributed ML training and HPC workloads on Kubernetes.
JobSet aims to solve gaps in Kubernetes primitives for distributed ML training, offering a unified API for large-scale distributed HPC and ML use cases.
Key features of JobSet include Replicated Jobs, automatic headless service management, configurable success and failure policies, and exclusive placement per topology domain.
It models distributed batch workloads as Kubernetes Jobs and uses ReplicatedJobs to manage child Jobs, allowing users to define different pod templates for various groups of pods.
JobSet integrates with Kueue for workload queuing, oversubscription, multi-tenancy, and more, enhancing cluster resource utilization.
An example use case of JobSet involves distributed ML training on multiple TPU slices using Jax, showcasing its capabilities for TPU-based workloads.
Future JobSet features include configurable success and failure policies, seamless integration with Kubernetes, and providing a rich API for distributed computing tasks.
Developers and contributors are encouraged to engage with the JobSet project, offer feedback, report bugs, suggest features, and participate in its development.
For more information on JobSet, its roadmap, and how to get involved, interested parties can visit the project repository, mailing list, or reach out on Slack.