Data Ingestion Strategies for GenAI Pipelines

by Pierre Villard | September 25, 2024

You did it! You finally led the charge and persuaded your boss to let your team start working on a new generative AI application at work and you’re psyched to get started.

You get your data and start the ingestion process but right when you think you’ve nailed it, you get your first results and let’s just say, they’re less than ideal. (You know it’s bad when you ask your company’s brand new chatbot to create a pitch for itself and it ends up talking about your competitors’ features.)

Many people think it’s easy to get data and push embeddings into a vector database but when you go to production there are several things you need to consider if you want your project to be a success.

Today, we’re going to discuss how to solve this problem and what data ingestion is needed for generative AI use cases.

Efficient parsing and chunking

If you are going to succeed at ingesting files into your new GenAI system you want to make sure you have efficient parsing and chunking systems in place to prevent AI hallucinations and to generate high quality answers. That said, there’s a lot more that goes into parsing and chunking than you might think.

Variables to think about include:

The Type of Document (PDF, Slides, Docs, etc)
The Verbosity of the Information (detailed white paper with charts versus a simple one pager)
The Semantic Category (Formally approved document, Slack discussion, etc)
The Data Representation (image, table, paragraph, etc)

Each of these require different parsing, chunking, and post processing strategies if you want optimal answers. Here at Datavolo, we provide specific services to ensure optimal results when extracting data from unstructured documents using different ML services for layout detection, tabular data extraction, etc.

Remember, bigger (or a larger number of tokens) does not always mean better when it comes to working with LLMs. Making sure your data is chunked into relevant pieces of information with optimal sizes is crucial (and often, cheaper!).

PII / anonymization capabilities

To comply with specific regulatory requirements, it may be required to automatically detect PII and perform anonymization of the data, especially when interacting with cloud based solutions such as OpenAI.

Datavolo provides these capabilities with unique processors that can easily be added into the data ingestion flow. More broadly, entity recognition to drive enrichment and other forms of post processing is often important to increase retrieval efficiency.

CDC on source file systems to deal with permissions and updates

One of the biggest concerns we have heard from many of our clients is all about permissions. How do we make sure that we don’t give access to too much data to the wrong people?

If two different individuals are asking the same question, but they don’t have the same level of access to specific documents, they shouldn’t get the same answer.

In other words, the answer to my question should be only based on data coming from documents that I have access to (as well as auxiliary data, used for enrichment).

It means that the permissions need to be preserved alongside the vectors / chunks so that retrieval in the vector database takes permissions into account. And on top of that, any change on the permissions of the document should be propagated to the associated chunks / vectors.

Datavolo provides change data capture (CDC) capabilities on source systems (including files from sources like Sharepoint, Google Drive, etc) to help with this requirement.

Don’t duplicate data everywhere | Don’t overload your vector database

When building a quick demo, it’s very likely your first instinct would be to store the vector and the associated chunk together in the same vector database.

But not only does it mean you’re replicating your entire dataset into the vector database (storage cost), it’ll also cause huge issues on the database when scaling with a large number of documents.

Here at Datavolo, some techniques we offer include storing the chunks in another location and keeping a pointer to that location in the vector’s metadata, or having the chunk offsets range associated with the vector to not duplicate any data.

This helps save costs, makes it much easier to scale, and is much more efficient.

Lineage to track answers to original documents

One of the biggest challenges with generative AI platforms is finding out where the platform got the data that it used in its answers to queries. It seems like a small thing, but can make a big difference when trying to confirm if information generated by an AI is a hallucination or not.

Plus, it’s really frustrating if you’re using generative AI to navigate thousands of pages of documents and you can’t cite your sources.

Fortunately, a key capability of NiFi is its lineage based on the provenance data that is generated for any data going through NiFi at every single point of the ingestion pipeline.

Aka, NiFi was designed to make sure that it’s easy to cite the sources for all unstructured data at each stage of the ingestion process.

This makes it super easy to debug hallucinated answers and improve the pipeline.

From the answer, we can know what chunks have been used to generate the answer, we can then go back in NiFi to go back to the original document from which the chunk is coming and visualize all of the processing of that file (data extraction/annotation/anonymization, chunking, etc) and pinpoint the places in the pipeline where things should be fine-tuned.

Flexibility to switch or combine different embedding models, LLMs, and vector databases

When it comes to generative AI, you want to be able to make changes and updates quickly and easily, especially since new LLM models and features are released every week. It’s absolutely essential to have a solution where it’s easy to switch from a solution to another for any dependency in the ingestion pipeline.

With NiFi, it’s super easy to change source systems from which data is collected, to go from one embedding model to another, or to test multiple vector databases in parallel to compare performances.

And with Datavolo’s flow designer, it takes a few seconds to switch from a specific solution to another, or support multiple solutions in the same pipeline for a given amount of time to compare new models without disrupting what is already in production.

Moreover, Datavolo will also maintain integrations throughout the AI ecosystem, leading to less maintenance work for our users like you.

Conclusion

Mastering data ingestion is the key to unlocking the full potential of your generative AI pipeline. From efficient parsing and chunking to ensuring privacy with PII anonymization, maintaining permissions, and preventing data duplication, every step plays a crucial role in delivering high quality, scalable results.

Here at Datavolo, we are committed to providing flexible, cutting edge solutions that empower your team to overcome these challenges so you can focus on building innovative AI applications that drive real value for your organization.

Interested in learning more about how Datavolo can make you become a 10x data engineer for AI? Sign up for a demo here.

Data Ingestion Strategies for GenAI Pipelines

Efficient parsing and chunking

PII / anonymization capabilities

CDC on source file systems to deal with permissions and updates

Don’t duplicate data everywhere | Don’t overload your vector database

Lineage to track answers to original documents

Flexibility to switch or combine different embedding models, LLMs, and vector databases

Conclusion

Top Related Posts

Onward with ONNX® – How We Did It

Tutorial – How to Convert to ONNX®

Survey Findings – Evolving Apache NiFi

Secure Data Pipeline Observability in Minutes

How to Package and Deploy Python Processors for Apache NiFi

Troubleshooting Custom NiFi Processors with Data Provenance and Logs

Apache NiFi – designed for extension at scale

Data Pipeline Observability is Key to Data Quality

Building GenAI enterprise applications with Vectara and Datavolo

Datavolo Announces Over $21M in Funding!