The Reality of Operating at 500 Billion Inferences Per Day

Everyone talks about the magic of Artificial Intelligence, but very few talk about the engineering plumbing required to keep it from collapsing under its own weight. When you are a massive global enterprise, a non-deterministic chatbot that takes a full minute to reply is more of a liability than an innovation.

In this talk from the Nordic Data Science and Machine Learning (NDSML) Summit in Stockholm, Sweden, the current Big Data and ML/AI product leader Sanchit Juneja, shows how the operations behind Booking.com’s infrastructure flows and is showing what people are not seeing behind the web. With nearly two decades of experience across the US and Europe, he breaks down how a global giant handles one of the largest machine learning footprints in the tech industry today.

If an organization is trying to connect core research models and profitable, scalable applications, this talk serves as the ultimate blueprint.

The Architecture: Before and After the LLM Boom

Before the explosion of Large Language Models (LLMs), a highly scaled production environment followed a predictable cadence. Data was clean and structured, models were trained offline, and they were served online to power real-time personalization.

At Booking.com, this traditional ecosystem is staggering: customers simultaneously interact with more than 450 machine learning models at any given microsecond. The infrastructure regularly serves more than 500 billion inferences a day while also scaling up to 800 billion predictions during peak windows. All while maintaining a real-time latency threshold of less than 20 milliseconds.

But what happens when there is an introduction of an unpredictable nature of LLMs to a well-oiled machine?

He addresses the dramatic shift inside the engineering stack, detailing the transition from basic MLOps to complex Foundational Model Operations (FMops).

The Core Challenge is thatIf a machine learning system wants to reply in a deterministic manner, it can take up to a minute to process. If it replies in five seconds, it risks spewing non-deterministic garbage to the users. When a traveler’s flight is canceled at an airport, they do not need conversational noise. What they need is a precise, hard answer.”

To solve this, advanced engineering teams have had to quickly construct entirely new operational layers:

  • The Middleware Wrapper: A domain-agnostic layer sitting between internal data pipelines and third-party hosted LLM services (like AWS SageMaker or Google Vertex) to mask complexity.
  • The Prompt Store & Vector Management: Systems explicitly designed to inject speed and absolute determinism into chatbot frameworks.
  • Evolving Observability: Moving past traditional data logging into aggressive model guardrailing to actively prevent model and data drift.

Data Fracturing and the New Hardware Battleground

As organizations reach exabyte-scale data lakes, traditional setups fracture under the pressure. The speaker addresses how massive systems split between transactional layers (OLTP) and analytical systems (OLAP), shedding light on why younger companies must leverage Hybrid Transactional Analytical Processing (HTAP) before reaching a critical breaking point.

The speech also tackles the physical constraints of modern high-performance computing. While GPUs were historically built for gaming, the post-LLM landscape has sparked an entirely new hardware ecosystem. From Amazon’s Trainium to specialized processing units like Groq, the blueprint details the massive shift toward distributed GPU infrastructures and the crucial need for specialized model optimization frameworks.

The session explores the real secret to scaling: feedback loops. Through Reinforcement Learning from Human and AI Feedback (RLHF), production environments are moving away from stale models and steering directly toward automated label collection: unlocking continuous return on investment without massive, recurring manual overhauls.

Full Structural Strategy

The difference between enterprises that bleed capital on raw compute and those that successfully monetize the AI boom rests entirely on infrastructure design. This presentation skips the theoretical philosophy and focuses purely on high-volume reality: evaluating data drift, choosing between specialized Small Language Models (SLMs) on localized hardware, and designing bulletproof guardrails.

Want to deep dive into the complete structural diagrams, performance optimization tactics, and live architectural comparisons?

Unlock this exclusive talk alongside the complete, premium NDSML archive.

Join the Leading Minds in Stockholm

The conversation doesn’t stop with a presentation. The annual NDSML Summit returns to the Filadelfia Convention Center in Stockholm, Sweden this November, gathering the most prominent Data Science, Data Engineering, and Machine Learning minds from across the Nordics and Europe.

From Models to Operational AI, NDSML is the premier Nordic place to network with industry architects, discover cutting-edge deployments in generative and agentic workflows, and gain actionable tools to mature your data function.

Explore the Official Agenda & Secure Your NDSML Summit Ticket Here.

Add a comment

Leave a Reply