Data Ingestion Concerns for GenAI Pipelines

You did it! You finally led the charge and persuaded your boss to let your team start working on a new generative AI application at work and you’re psyched to get started.

You get your data and start the ingestion process but right when you think you’ve nailed it, you get your first results and let’s just say, they’re less than ideal. (You know it’s bad when you ask your company’s brand new chatbot to create a pitch for itself and it ends up talking about your competitors’ features.)

Many people think it’s easy to get data and push embeddings into a vector database but when you go to production there are several things you need to consider if you want your project to be a success. 

Today, we’re going to discuss how to solve this problem and what data ingestion is needed for generative AI use cases. 

Efficient parsing and chunking

If you are going to succeed at ingesting files into your new GenAI system you want to make sure you have efficient parsing and chunking systems in place to prevent AI hallucinations and to generate high quality answers. That said, there’s a lot more that goes into parsing and chunking than you might think.

Variables to think about include:

  • The Type of Document (PDF, Slides, Docs, etc)
  • The Verbosity of the Information (detailed white paper with charts versus a simple one pager)
  • The Semantic Category (Formally approved document, Slack discussion, etc)
  • The Data Representation (image, table, paragraph, etc)

Each of these require different parsing, chunking, and post processing strategies if you want optimal answers. Here at Datavolo, we provide specific services to ensure optimal results when extracting data from unstructured documents using different ML services for layout detection, tabular data extraction, etc. 

Remember, bigger (or a larger number of tokens) does not always mean better when it comes to working with LLMs. Making sure your data is chunked into relevant pieces of information with optimal sizes is crucial (and often, cheaper!).

PII / anonymization capabilities

To comply with specific regulatory requirements, it may be required to automatically detect PII and perform anonymization of the data, especially when interacting with cloud based solutions such as OpenAI. 

Datavolo provides these capabilities with unique processors that can easily be added into the data ingestion flow. More broadly, entity recognition to drive enrichment and other forms of post processing is often important to increase retrieval efficiency.

CDC on source file systems to deal with permissions and updates

One of the biggest concerns we have heard from many of our clients is all about permissions. How do we make sure that we don’t give access to too much data to the wrong people?

If two different individuals are asking the same question, but they don’t have the same level of access to specific documents, they shouldn’t get the same answer. 

In other words, the answer to my question should be only based on data coming from documents that I have access to (as well as auxiliary data, used for enrichment).

It means that the permissions need to be preserved alongside the vectors / chunks so that retrieval in the vector database takes permissions into account. And on top of that, any change on the permissions of the document should be propagated to the associated chunks / vectors. 

Datavolo provides change data capture (CDC) capabilities on source systems (including files from sources like Sharepoint, Google Drive, etc) to help with this requirement.

Don’t duplicate data everywhere | Don’t overload your vector database

When building a quick demo, it’s very likely your first instinct would be to store the vector and the associated chunk together in the same vector database. 

But not only does it mean you’re replicating your entire dataset into the vector database (storage cost), it’ll also cause huge issues on the database when scaling with a large number of documents. 

Here at Datavolo, some techniques we offer include storing the chunks in another location and keeping a pointer to that location in the vector’s metadata, or having the chunk offsets range associated with the vector to not duplicate any data.

This helps save costs, makes it much easier to scale, and is much more efficient.

Lineage to track answers to original documents

One of the biggest challenges with generative AI platforms is finding out where the platform got the data that it used in its answers to queries. It seems like a small thing, but can make a big difference when trying to confirm if information generated by an AI is a hallucination or not. 

Plus, it’s really frustrating if you’re using generative AI to navigate thousands of pages of documents and you can’t cite your sources.

Fortunately, a key capability of NiFi is its lineage based on the provenance data that is generated for any data going through NiFi at every single point of the ingestion pipeline. 

Aka, NiFi was designed to make sure that it’s easy to cite the sources for all unstructured data at each stage of the ingestion process.

This makes it super easy to debug hallucinated answers and improve the pipeline. 

From the answer, we can know what chunks have been used to generate the answer, we can then go back in NiFi to go back to the original document from which the chunk is coming and visualize all of the processing of that file (data extraction/annotation/anonymization, chunking, etc) and pinpoint the places in the pipeline where things should be fine-tuned.

Flexibility to switch or combine different embedding models, LLMs, and vector databases

When it comes to generative AI, you want to be able to make changes and updates quickly and easily, especially since new LLM models and features are released every week. It’s absolutely essential to have a solution where it’s easy to switch from a solution to another for any dependency in the ingestion pipeline. 

With NiFi, it’s super easy to change source systems from which data is collected, to go from one embedding model to another, or to test multiple vector databases in parallel to compare performances.

And with Datavolo’s flow designer, it takes a few seconds to switch from a specific solution to another, or support multiple solutions in the same pipeline for a given amount of time to compare new models without disrupting what is already in production.

Moreover, Datavolo will also maintain integrations throughout the AI ecosystem, leading to less maintenance work for our users like you.

Conclusion

Mastering data ingestion is the key to unlocking the full potential of your generative AI pipeline. From efficient parsing and chunking to ensuring privacy with PII anonymization, maintaining permissions, and preventing data duplication, every step plays a crucial role in delivering high quality, scalable results. 

Here at Datavolo, we are committed to providing flexible, cutting edge solutions that empower your team to overcome these challenges so you can focus on building innovative AI applications that drive real value for your organization.

Interested in learning more about how Datavolo can make you become a 10x data engineer for AI? Sign up for a demo here.

Top Related Posts

Onward with ONNX® – How We Did It

Digging into new AI models is one of the most exciting parts of my job here at Datavolo. However, having a new toy to play with can easily be overshadowed by the large assortment of issues that come up when you’re moving your code from your laptop to a production...

Tutorial – How to Convert to ONNX®

Converting from Pytorch/Safetensors to ONNX® Given the advantages described in Onward With ONNX® we’ve taken the opinion that if it runs on ONNX that’s the way we want to go.  So while ONNX has a large model zoo we’ve had to convert a few models by hand.  Many models...

Survey Findings – Evolving Apache NiFi

Survey of long time users to understand NiFi usage Datavolo empowers and enables the 10X Data Engineer. Today's 10X Data Engineer has to know about and tame unstructured and multi-modal data. Our core technology, Apache NiFi, has nearly 18 years of development,...

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

How to Package and Deploy Python Processors for Apache NiFi

Introduction Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive...

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...

Data Pipeline Observability is Key to Data Quality

In my recent article, What is Observability, I discussed how observability is crucial for understanding complex architectures and their interactions and dependencies between different system components. Data Observability, unlike Software Observability, aims to...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...