Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases at Meta. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to operate this stack in production.
– Challenges in production for ML Training Infrastructure
– How is operating ML Infra different from standard Infra services
– How to measure reliability and operate ML Infra at scale