When Data Pipelines Start Thinking: The Rise of Multimodal AI in Modern Data Engineering

As global video streams and sensor logs outpace traditional processing, the data engineering role is shifting from simple plumbing to building systems capable of true interpretation. Shantam Mogali, Sr. Data Engineer at Google, stands at the forefront of this transition, balancing the rigor of large-scale infrastructure with the cognitive power of generative models. Here, he explores how we are moving toward an era where the pipeline itself possesses the intelligence to decode context and sentiment at scale.

Over the last decade, the data engineering landscape has undergone a massive transformation-evolving from the early days of data mining and the Big Data wave to the current Generative AI era. Shantam Mogali, Sr. Data Engineer at Google, has navigated each of these cycles, building the high-performance systems that sit at the intersection of large-scale architecture and impact-focused insights. Today, leading data strategy for Google Pixel across the Display, Safety, and AI domains, Shantam manages the full data lifecycle for millions of devices, turning raw, high-volume logs into the decision-grade stories that inform future technology roadmaps.

In this discussion, we explore a fundamental shift in how organizations analyze unstructured information like video, text, and imagery. Drawing on his experience at Google and throughout the industry, Shantam breaks down the transition from traditional, rigid computer vision methods to the flexibility of Vision-Language Models (VLMs) and zero-shot learning. From treating prompts as versioned production artifacts to the mechanics of quantifying brand exposure in real-time, he provides a technical blueprint for a new era of data engineering-one where AI acts as a force multiplier for strategy and human-led innovation.

In your upcoming session at the Data Innovation Summit, you compare traditional computer vision methods with multimodal AI. What are the primary differences you’ve observed between these two approaches when it comes to analyzing unstructured data like text or video?

Shantam Mogali: Multimodal AI unlocked new capabilities in the domain of analytics, or perhaps streamlined would be a better way to phrase it. Generative AI allows us to analyze unstructured data like texts, audio, images, and even video. To answer your question, the transition from traditional analytics (computer vision in this case) to multimodal AI represents a fundamental shift in how we architect data intelligence. In my experience across the industry, the primary differences lie in contextual depth and the democratization of insights.

Let’s take vision analysis as an example, traditional methods are exceptionally good at identifying what is in a frame but multimodal AI (such as Vision-Language Models) mirrors human cognition by processing visual and textual signals simultaneously. It doesn’t just see a brand logo, it understands the sentiment of the scene.

By shifting to foundational models, we can now leverage zero-shot learning. This allows significantly reducing the time-to-insight. What used to take months of labeling and retraining can now be done in days. These models are pre-trained on massive, diverse datasets, allowing them to understand new concepts via natural language prompts rather than intensive retraining. From a data engineering perspective, this simplifies a lot of workflow development allowing you to spend more time on deriving insights.

Your presentation covers the evolution of data engineering workflows for unstructured data analysis such as vision analysis. Can you walk us through what a typical workflow looks like in this context, and what requirements an organization needs to have in place to support it?

Shantam Mogali: We are witnessing a seismic shift from the ‘Passive Pipelines’ era, focused on moving structured records, to the ‘Active Reasoning’ era, where the rise of AI serves as a critical enabler. A typical ETL workflow for high-volume unstructured data, such as vision or sensor analysis, follows a sophisticated, AI-enhanced path.The Extraction (E) phase has evolved to require a more diverse technical knowledge set to accommodate these complex modalities, setting the stage for a transformation layer that does more than just move data, it interprets it.

The most significant evolution is in the Transformation (T) phase, where LLM-based reasoning is directly integrated into the pipeline to ‘read’ patterns and sentiment within previously inaccessible ‘dark data.’ This transforms raw noise into structured, actionable insights at scale. Consequently, the Loading (L) phase has shifted beyond traditional validation like row counts or data type checks; sophisticated semantic frameworks, such as LLM-as-a-judge and human-in-the-loop validation are implemented to ensure the integrity and reasoning accuracy of the data before it reaches the end-user.

To thrive in this environment, organizations need a growth mindset in that AI moves so fast, teams must be empowered to test and refine new frameworks in real-time rather than just following a static roadmap.

One of your focus areas is using multimodal AI to find and quantify brand exposure. How does this technology actually identify a brand within a video, and what are the steps involved in turning those visual images into quantifiable data?

Shantam Mogali: In technical terms, it leverages semantic alignment by using Vision-Language Models (VLMs), that operate in a “joint embedding space” where images and text are mapped together. It starts with frame sampling to extract visual data from the video stream. These frames are passed through a vision encoder to create high-dimensional vectors. Instead of searching for a pixel-perfect match, natural language prompts are used (such as “Is brand ABC visible in this sports setting?”) to let the model reason through the scene. Because the model has a pre-trained understanding of brand identities, it can identify presence via zero-shot learning. Finally, these inferences are aggregated into a structured data layer, quantifying duration and prominence for high-volume analytics

You have worked on implementing LLM-based data pipelines. How do Large Language Models interact with unstructured data in your current frameworks, and what specific tasks are they being used to solve?

Shantam Mogali: In modern data engineering, LLMs act as a semantic processing layer that bridges the gap between raw, unstructured text and structured, queryable databases. Rather than building fragile, regex-based parsers, LLMs are directly integrated into the pipeline to reason through data at scale. For text modalities few architectural use cases include sentiment and intent analysis, thematic categorization, content summarization etc. For vision, it can be leveraged to generate a JSON log of every image where a specific product appears, including its context (e.g. placed on a wooden table in sunlight). By combining these capabilities, LLMs enable unified, multimodal insights that accelerate decision making and reduce manual data wrangling across both text and visual content.

You mention mastering prompt engineering specific to multi-modal analysis. How does a ‘prompt’ function within a data engineering pipeline, and what is your process for testing its effectiveness?

Shantam Mogali: This is a great question because prompting is often viewed in a cavalier manner rather than as a rigorous engineering discipline.

In practice, prompts should be first-class pipeline artifacts, much like schemas or transformation specs. They’re versioned, reviewed, and deployed as production logic, with explicit output schemas, constraints, and deterministic formatting guarantees. While model outputs are stochastic, variability is tightly bounded: structure, keys, nullability, and allowed values are fixed. If a downstream system has to guess whether it’s receiving an object or an array, the system has already failed. Bounding variability enables validation, replay, and safe downstream joins.

Prompts are never hardcoded. They live in registries or codebases, referenced by ID, which enables shadow deployments, A/B tests, and historical replays. At scale, manually writing and tweaking prompts is neither scalable nor a good use of engineering time. Instead, prompt optimization itself becomes automated. A common pattern is to use a “teacher” or meta-model to propose candidate prompt variants based on observed failure modes. These candidates are then benchmarked against a manually curated golden dataset that captures edge cases, ambiguity, and high-risk scenarios. Each prompt version is scored using a defined loss function or task-specific metrics (such as F1), and only statistically superior variants are promoted.

Overall, testing prompt effectiveness has to be approached with a data engineering mindset, not a prompt-hobbyist one. Prompts are pipeline components: they are tested, benchmarked, monitored for drift, and iterated on systematically.

Looking at the evolution of these workflows, what kind of impact do you expect LLM or Multimodal AI to have on the broader field of data engineering over the next few years?

Shantam Mogali: I think core data engineering work would fundamentally remain the same, but LLMs or multimodal AI are already beginning to accelerate and augment workflows. AI can help generate code, create transformations, analyze sample datasets, and flag biases or anomalies, allowing engineers to move faster and focus on complex challenges.

Another thing is that LLMs won’t replace pipelines but will change how engineers think about and interact with data. Engineers will increasingly manage complex orchestration, validate outputs, and ensure compliance, while LLMs accelerate experimentation and insight generation.

Emerging trends like agentic AI, which can autonomously execute multi-step data workflows, and conversational BI, which allows teams to query and explore data via natural language, are starting to transform how engineers and business users interact with data as well.

The non-deterministic nature of LLM outputs means human oversight remains critical, especially for production pipelines, complex orchestration, and compliance. Over the next few years, LLMs could make data engineering even more exploratory and accessible, accelerating experimentation while ensuring rigor and quality. In essence, LLMs are a force multiplier, blurring the lines between data engineering, data science, and BI, and freeing engineers to focus on broader data strategy and impact.

To witness these methodologies translated into real-world applications, catch Shantam Mogali at the Data Innovation Summit. He will be conducting a practical, end-to-end walkthrough on how to architect LLM-based pipelines that decode brand presence within complex media streams. Attendees will gain hands-on perspective on mastering durable prompt engineering and leveraging multimodal AI to transform unstructured “dark data” into high-impact, quantifiable results.