ML Training in Production at Meta – Shivam Bharuka, Meta


Session Outline

Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases at Meta. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to operate this stack in production.

Key Takeaways

– Challenges in production for ML Training Infrastructure

– How is operating ML Infra different from standard Infra services

– How to measure reliability and operate ML Infra at scale

Add comment