At the enterprise level today every company is a data company and the only constant is change itself. Generative AI is our latest compelling reminder. Legacy systems are serving their noble purpose of ensuring the business of today is getting done while organizations constantly strive to improve the value they add to their customers through better services and technologies often referred to as ‘modernization’. Most enterprises have critical data systems at the edge, the datacenter(s), and across one or more cloud service providers and often across the globe. The hybrid cloud enterprise is here and growing. To maximize the value of this data these data systems need to be connected and the data itself must flow from its various points of origin to the systems which can extract and enhance its value, to the systems which can store it and allow for query, dashboards, reporting and more. This ongoing interplay between modernization efforts and legacy systems drives this constancy of change and powerful data pipelines up to this task are a critical enabler of enterprise success.
Born for this
In late 2014, the National Security Agency released the now open source software known as Apache NiFi. For the NSA, NiFi is central to enabling this change as a powerful data pipeline capability and specifically to ensure the NSA’s often unstructured and multimodal data gets to the right systems in time to maximize the value. Before being open sourced, NiFi benefited from nearly 8 years of development in a high scale and high stakes environment and was originally designed to solve for three main motivations:
- Build data pipelines that are highly adaptable
- Modifiable in a self service fashion by the largest base of users possible
- Providing full chain of custody for every piece of data end to end
Let’s briefly discuss each of these motivations in a bit more detail.
The case for dynamic and flexible data pipelines in the enterprise
When we consider every company as a data company and that change is constant this compels us to have data pipelines across the enterprise which can adapt to rapid changes in requirements, operational models, and failure conditions. NiFi was designed from the ground up for data of any type, both structured data and what we today call unstructured or multimodal data. The data NiFi needed to handle could be small in terms of bytes or large as in single objects many Gigabytes in size, and could be high volume, with rates of hundreds per day to millions per second. The data NiFi can handle might be an audio or video image stream or might be a raw signal captured by a sensor or might be deeply nested hierarchical structured JSON or XML, or text-based log entries, or highly structured database rows and records.
However, handling this diversity isn’t enough. At the simplest level we tend to think of data pipelines as they’re initially designed and built. But the reality is that enterprises are continuously coping with a changing world and changing variables. The volume, frequency, quality, protocols, and systems involved in a data pipeline constantly evolve and NiFi enables users to cope with this change along many different dimensions, dynamically at runtime.
User experience tailored to each unique skill set of various personas
At any given enterprise there is often a wide range of personnel who can understand and articulate the requirements for a data pipeline. This could be business analysts, data engineers, data scientists, system administrators, IT specialists, and more. Yet it is often the case that a small subset of these people are proficient programmers and even smaller yet is the group of proficient programmers skilled in a particular data pipeline system. NiFi’s entire model is designed to democratize the access and ability to create, understand, and manipulate live data flows, through its interactive command and control model. While users with programming skills can write code or interact with the API through programmatic means, users who might not know how to code or simply prefer can use NiFi’s codeless interface. These users can articulate what they know as requirements in the form of a simple drag and drop interface to build powerful directed graphs of data flow in a manner very similar to how they would have drawn such a thing on a whiteboard. This dramatically lowers the barrier to entry and increases the user base that can build, maintain, and evolve these vital data pipelines.
Chain of custody is critical for Observability
The data-driven enterprise must understand the origin and attribution of its data. The initial reasons that people gravitate toward this understanding involve security and compliance and no doubt that is important. But even more fundamental are things like being able to trace the source of data quality issues, understanding where data of highest and lowest value comes from, or even knowing that a particular set of data even already exists and how to tap into it rather than creating yet another pipeline to a given source. NiFi naturally captures and creates powerful metadata about the data it handles in its pipelines. NiFi’s powerful data provenance or chain of custody tracks the where, when, and how of all data flowing through its pipelines.
Support for structured data and ETL use cases as well
While the early years of NiFi were about unstructured and multimodal data, its following 8 years in the open source world were largely about structured data as we’d see in and around the Big Data space. During that time NiFi was extended and configured to allow for more enterprise common cases like Change Data Capture and ETL or ELT usage to and from relational databases. These data pipelines with NiFi tend to interact heavily with messaging systems like Kafka, structured data in cloud data lakes such as Databricks Delta and Apache Iceberg, data warehouses such as Snowflake, and various databases such as MS SQL Server, Postgres, Oracle and others. And of course many of the pipelines involve getting data to and from various Cloud Service Provider services across Google Cloud, Microsoft Azure, and Amazon Web Services. In this time the open source community for NiFi has received code contributions from more than 400 people and has worldwide use in every industry with thousands of companies using NiFi for their data pipelines. NiFi is used for data pipelines in cybersecurity, observability as well as handling structured and semi-structured data in industries such as financial services, telecommunications, retail, automotive, healthcare, and others.
Helping the enterprise realize its massive opportunity with unstructured data
And while we’ve all heard for a long time that for many enterprises unstructured data is the largest and fastest growing percentage of data they have, extracting the value at scale has seemed more elusive. In 2022 a sea change took place as Generative AI powered by Large Language Models moved from a primarily research phase to being in a direct business value phase and the promise is exciting and enormous. The primary fuel for Generative AI is unstructured and multimodal data and data pipelines will be essential to moving these use cases into production. Examples of unstructured data for such use cases might be powerpoint, excel, PDF, text files, images, audio, and video and the multimodal aspect comes into play when you consider how all these disparate different types of data might be necessary to fully understand the essence of a particular thing.
The types of data pipelines for Generative AI now are concerned with a variety of things starting with how I can capture this unstructured data from various sources – often hard to reach and referencing large data which doesn’t necessarily fit on an event stream or in a database table choking existing ETL style systems. Once you’ve automated the capture of unstructured data you need to identify the best way to chunk and split the data, extract structured data from it, and create embeddings from the data for search and retrieval. The techniques and models for doing so are evolving rapidly and can greatly impact the quality of LLM responses. Finding the optimal approach requires careful evaluation and depends on a given use case.
Today, one of the more important patterns evolving within in-context learning for LLMs is the concept of Retrieval Augmented Generation (RAG). So far we’re just describing the very first step of RAG which is how to get your enterprise documents and data into the context of the LLM. Often this context will be retrieved from a vector store and then used to augment the user’s prompt. There are many additional aspects that need to be managed in a pipeline and orchestrated across source documents, associated vector stores, and LLMs, including single and multi-agent approaches and more. The entire space is evolving rapidly but it is safe to say the same concerns of the past will apply such as how to ensure security of proprietary data and how to avoid vendor lock-in.
Datavolo to automate production quality pipelines for unstructured and multimodal data
Given this context, Datavolo has launched with a very clear goal of helping organizations go to production with Generative AI specifically by automating their unstructured and multimodal data pipelines, powered by a solution which is highly adaptable, self-service to a wide range of users, and provides the necessary security and governance capabilities. Datavolo will provide cloud-native solutions powered by open source Apache NiFi which is purpose-built for this moment that Generative AI presents. We look forward to taking this journey with you.