Putting the Data in Databricks

Datavolo solves a common problem for Databricks’ users… how to securely and continuously ingest enterprise unstructured data

Diagram showing how Datavolo solves Databricks problems

Datavolo provides the industry’s only enterprise proven platform for Generative AI data pipelines. Generative AI applications are uniquely dependent upon unstructured data – whether it’s for model training, RAG applications, or agentic architectures…. Datavolo solves for all scenarios requiring the secure and continuous ingestion of large scale unstructured data.

Databricks provides industry leading software to help enterprises build, scale, and govern data and AI. Due to their flexible and open data layer, Databricks’ customers are easily able to adopt and deploy the latest open source AI models – including ones created by Databricks. Furthermore Databricks has invested in the components, such as a vector store, to help customers rapidly build the latest GenAI Applications.

However, while Databricks has a long history and many partners (and an acquisition) of tools to help manage the secure and scalable ingestion of structured data, there remains a significant challenge with regards to the continuous and automated secure ingestion of unstructured data. Datavolo solves that challenge.

Based on Databricks users’ feedback we have architected for two common scenarios. First, many organizations simply need an enterprise wide ingestion framework for the secure, governed flow of unstructured data into Databricks’ Unity Catalog and Delta Lake. Second, some organizations would like to take it a step further and need the capability to extract, parse, chunk, and vectorize the data into Databricks Vector Search for consumption by DBRX or other open source LLMs

Scenario #1 – Continuous ingestion of unstructured data into Databricks

Many organizations have a preference for the ease and simplicity that Databricks offers with regards to processing data for consumption by LLMs. The organizations use and trust Databricks for all their data processing and a one-stop shop for prep and consumption by AI models is very appealing. In this scenario, the gap that needs to be solved is for the acquisition of this unstructured data across the enterprise and the continuous secure and traceable ingestion of the data into Databricks. Datavolo handles this with ease as this use case has been a core competency of our engineering teams since the very genesis of our underlying open source platform, Apache NiFi.

Scenario #2 – Ingestion as well as AI-specific transformations in the unstructured data pipeline

For many of our customers, they require another degree of power and flexibility with their unstructured data pipelines. The range of transformations necessary to go from an unstructured document such as a PDF to a structured representation of that document involves many steps each of which important to be done in a manner complimentary to the optimal consumption of that data. From a simple PDF you need to extract the structure of the document, the narrative text, the tabular information from tables, the semantics/intent of charts and embedded images. That refinement involves parsing using computer vision, LLMs for synthesis, named entity recognition for sanitization and enrichment, structural and semantic chunking, and of course embedding generation. Our customers consistently tell us the composability is key here whereby they can leverage our out of the box solutions but also plug in their own steps or alternatives including leveraging Cloud Service Provider offerings which may be best in breed for a certain type of data. Datavolo pipelines make it easy to compose and swap different components at any time.

Finally, the state of AI Systems requires quite a bit of experimentation to evaluate the right strategies based on the desired outcome of the application. Datavolo allows data engineers to rapidly iterate, even in parallel, on different strategies for things such as parsing or chunking, and to allow customers to automate evaluation of processing strategies. Datavolo supports a wide continuum of use cases. Another segment of customers leverage Datavolo for data acquisition and extraction in lower cost, secured on-premise environments. Datavolo can then orchestrate the flow of data/metadata from these environments to Databricks which powers AI pipelines.

Conclusion

Whether it is simply the continuous ingestion of unstructured data through to the full processing necessary for modern AI Systems, Datavolo puts the data into Databricks.