ETL is dead, long live ETL (for multimodal data)

Why did ELT become the most effective pattern for structured data?

A key innovation in the past decade that unlocked the modern data stack was the decoupling of storage and compute enabled by cloud data warehouses as well as cloud data platforms like Databricks. This decoupling and the capability to scale and prioritize compute independently, supported an effective architectural pattern for getting data into the warehouse, turning the traditional ETL pattern–extract, transform, and then load–on its head.

In the new pattern, ELT or extract, load, and then transform, data engineers used simple point-to-point extract and load tools to move structured data from source systems into staging areas in the data warehouse. The transform step to make these objects useful for analytics was written in SQL that ran against these staged, raw tables in the cloud data warehouse. With the rise of analytics engineering came more rigor and more tooling, like dbt to orchestrate the transformations that would execute within the data warehouse.

The ELT pattern became more effective than ETL approaches, especially as storage became cheaper and data volumes grew, because it 1) reduced complexity, especially of the EL step, 2) it leveraged the scaled compute of the cloud data warehouse rather than requiring expensive, separate middleware that also had to scale, and 3) provided more flexibility to data engineers to easily create new analytical objects for the presentation layer without having to go back to source systems.

In recent years, it’s been interesting to watch this trend evolve and accelerate further as the cloud data warehouse continues to be unbundled: with open formats like Iceberg and Hudi handling table formats on cloud object storage, and in-process query engines like DuckDB or distributed engines like Trino as the compute layer.

The emergence of multimodal data and embeddings

As we’ve written about in a previous blog, as an industry we’ve been talking about the transformative potential of unstructured documents and data for over a decade, but this latest wave of breakthroughs has been monumental and put engineers in a very strong position to succeed. A lasting impact of this wave will be the emergence of embeddings as a first class data citizen within the ecosystem. Embeddings are dense vector-based representations of language, images, etc., which are learned in the first layers of deep neural nets–they are mathematical representations of the input data. 

As enterprise adoption converges on proprietary foundational models–with open ones likely having something to say in the future–RAG and related patterns are the primary way in which enterprises will get their data to these models. Enterprises will be less concerned with model customization via fine tuning models, as this approach tends to be challenging and costly. Therefore, embedding APIs and vector databases will continue to be important picks and shovels for multimodal data engineering.

Why will ETL be important for multimodal data?

Once chunks of text are transformed to embeddings and indexed in a vector store or search index, it’s not possible to transform or enrich the data further. This is in direct contrast to the ELT pattern for structured data, which relies on the similarity of staged, raw data and finalized, presentation data for analytics and the applicability of SQL to execute transformations after data has landed. This means that the steps of enriching chunks of data–with other metadata, such as what document it came from, the date of that document, metrics associated with that document like views, etc.–becomes a critical step that will often precede the loading step of these pipelines.

In particular, since the metrics are associated with the chunk (and sometimes its parent) RAG apps will rely upon a mapping of chunks to vectors in order to update the vector in the database, after it gets loaded–since the chunk itself is not colocated with the vector in the database.

This sequencing means that the classical ETL pattern will come back to the fore in the case of multimodal data and the processing steps that need to occur to make such data valuable for AI app development. Point-to-point, extract-and-load oriented tooling will be of limited value in this setting, as data engineers will need rich capabilities within the transform step of the pipeline, inclusive of enrichment via integrations with other enterprise data systems. The absence of SQL-based transformation, and of any DSL or API for pushing transformations down to the data stores, will directly challenge the consensus approaches that have emerged in the structured setting.

This is a primary reason that we’re incredibly excited to bring Datavolo to the market to help data engineers solve these challenges with purpose-built tooling. Our solution was built from the ground up to handle in-flow transformations for multimodal data, with a secure, scalable, and codeless developer experience. Please reach out with your feedback, we’d love your insights on the challenges your organization has faced with multimodal data! 

Top Related Posts

What is LLM Insecure Output Handling?

The Open Worldwide Application Security Project (OWASP) states that insecure output handling neglects to validate large language model (LLM) outputs that may lead to downstream security exploits, including code execution that compromises systems and exposes data. This...

Data Ingestion Strategies for GenAI Pipelines

You did it! You finally led the charge and persuaded your boss to let your team start working on a new generative AI application at work and you’re psyched to get started. You get your data and start the ingestion process but right when you think you’ve nailed it, you...

Prompt Injection Attack Explained

By now, it’s no surprise that we’ve all heard about prompt injection attacks affecting Large Language Models (LLMs). Since November 2023, prompt injection attacks have been wreaking havoc on many in house built chatbots and homegrown large language models. But what is...

Onward with ONNX® – How We Did It

Digging into new AI models is one of the most exciting parts of my job here at Datavolo. However, having a new toy to play with can easily be overshadowed by the large assortment of issues that come up when you’re moving your code from your laptop to a production...

Tutorial – How to Convert to ONNX®

Converting from Pytorch/Safetensors to ONNX® Given the advantages described in Onward With ONNX® we’ve taken the opinion that if it runs on ONNX that’s the way we want to go.  So while ONNX has a large model zoo we’ve had to convert a few models by hand.  Many models...

Generative AI – State of the Market – June 17, 2024

GenAI in the enterprise is still in its infancy.  The excitement and potential is undeniable.  However, enterprises have struggled to derive material value from GenAI and the hype surrounding this technology is waning.  We have talked with hundreds of organizations...

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

How to Package and Deploy Python Processors for Apache NiFi

Introduction Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive...

Troubleshooting Custom NiFi Processors with Data Provenance and Logs

We at Datavolo like to drink our own champagne, building internal tooling and operational workflows on top of the Datavolo Runtime, our distribution of Apache NiFi. We’ve written about several of these services, including our observability pipeline and Slack chatbots....

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...