Datavolo Architecture Viewpoint

The Evolving AI Stack

Datavolo is going to play in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance.

The value of any stack is determined by the app layer, as we saw with Windows, iOS, and countless other examples. While prototyping AI apps can be surprisingly easy, the path to production and automation is proving challenging for many. A major reason for this is the open questions that exist at every layer of the emerging stack. The design patterns are nascent and evolving, especially those patterns related to connecting data and documents with foundational models. One major cross-cutting challenge is evaluation–questions like how well is retrieval working, what is the quality of the model output, has the quality recently changed? In a future blog, we will go into greater depth on how Datavolo is incorporating evaluation into our pipeline health and observability features.

Engineers’ initial instinct was to build apps directly on LLMs–in the absence of a fully-fledged AI stack–but hallucinations and the need to add contextual data drove them down the rabbit hole of fine tuning, as well as towards in-context learning patterns like RAG to anchor model output on contextual documents and data.

We’ve also started to see the agentic pattern of LLM + Tools emerge for many AI apps, whether that’s a retrieval tool to augment model knowledge, a math tool, or a coding one. This is a compelling design pattern where the model itself makes decisions to interface with the right systems at the right time, and we’re seeing a lot of focus in this direction with the launch of OpenAI’s Assistant API as well as startups.

These design patterns, especially the focus on in-context learning, have led to several new solutions in different layers of the stack: app frameworks like LangChain and LlamaIndex, embedding APIs and vector databases like Pinecone, and new capabilities in data pipelines, orchestration, and observability & governance.

Some important prior work we’d like to note here are The Generative AI Stack: Making the Future Happen Faster by Madrona and the Emerging Architectures for LLM Applications | Andreessen Horowitz. These emerging POVs on the AI stack are aligned with our thesis on where Datavolo will be most impactful to enterprises and their need for multimodal data pipelines.  

Where does Datavolo fit in?

Datavolo is a tool for data engineers supporting AI teams. It provides a framework, feature set, and a catalog of repeatable patterns to build multimodal data pipelines which are secure, simple, and scalable.

Data engineers need tools to build multimodal pipelines across the continuum of foundational model integration and customization. The use cases these tools need to support span the continuum of in-context learning from prompt engineering to advanced RAG to agent API integration. Datavolo is designed from the ground up to support these use cases and, importantly, provides features for pipeline observability and governance, including evaluation of the retrieval system

ELT as a pattern is far less natural in the multimodal setting. The systems that produce multimodal data are not relational. The systems in which multimodal data land–embeddings and vector DBs, indexes and search platforms–are not intended to be data engineering platforms, where SQL based code is pushed down to the in-databases compute layer.

This will put the emphasis on transformation in the traditional ETL architecture, and on multimodal data frameworks that can cope with the reality of bridging these non-relational data producing systems with the emerging consuming systems.

All these different ideas and open research on design patterns for AI apps go to show a few key things: 

  1. Flexibility will be critical for AI engineers as the ground will continue to shift under their feet as the stack evolves and open questions are answered
  2. Data pipelines and orchestration capabilities will be crucial to building valuable AI apps. The key data engineering tasks of extraction, cleansing, enrichment, and loading (into data stores like vector databases and otherwise)
  3. App frameworks will be important and their focus will be on specific patterns like prompt chaining, memory, chat UX, etc. They will be most useful to developers in absorbing complexity as the different design patterns evolve

Datavolo was built to be extremely flexible, to provide the tooling data engineers need to handle multimodal data, and to integrate out of the box with the full AI stack to provide a secure, scalable, and codeless developer experience. Please reach out with your feedback, we’d love your insights on the challenges your organization has faced with multimodal data!

Top Related Posts

Survey Findings – Evolving Apache NiFi

Survey of long time users to understand NiFi usage Datavolo empowers and enables the 10X Data Engineer. Today's 10X Data Engineer has to know about and tame unstructured and multi-modal data. Our core technology, Apache NiFi, has nearly 18 years of development,...

Generative AI – State of the Market – June 17, 2024

GenAI in the enterprise is still in its infancy.  The excitement and potential is undeniable.  However, enterprises have struggled to derive material value from GenAI and the hype surrounding this technology is waning.  We have talked with hundreds of organizations...

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

How to Package and Deploy Python Processors for Apache NiFi

Introduction Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive...

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...

Data Pipeline Observability is Key to Data Quality

In my recent article, What is Observability, I discussed how observability is crucial for understanding complex architectures and their interactions and dependencies between different system components. Data Observability, unlike Software Observability, aims to...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...