The inference-time resource costs of large language and vision models pose challenges in production deployments.
A solution proposed is using foundation model programs, which can invoke foundation models with varying resource costs and performance.
A method is presented that translates a task into a program and learns a policy for resource allocation, selecting foundation model 'backends' for each program module.
Compared to monolithic multi-modal models, the implementation achieves up to 98% resource savings with minimal accuracy loss.