ML Training in Production at Meta – Shivam Bharuka, Meta

Premium content

Login or register to unlock the content

Session Outline

Machine learning models are growing rapidly in scale in order to support the recommendation and content understanding use-cases at Meta. In order to keep up with this growth, we have re-architected the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to operate this stack in production.

Key Takeaways

– Challenges in production for ML Training Infrastructure

– How is operating ML Infra different from standard Infra services

– How to measure reliability and operate ML Infra at scale

Add comment