High-performance vector retrieval with Pinecone

Datavolo helps Pinecone customers improve search performance and augment their vector store

Datavolo provides the industry’s only enterprise proven platform for Generative AI data pipelines. Generative AI applications are uniquely dependent upon unstructured data – whether it’s for model training, RAG applications, or agentic architectures. Datavolo solves for all scenarios requiring the secure and continuous ingestion of large scale unstructured data.

Pinecone’s managed, serverless vector database was purpose-built for GenAI applications. Rich out-of-the-box search and filtering is paired with impressive performance and scalability that enables teams to rapidly build out RAG solutions without needing to manage infrastructure.

Vector stores are only as good as their source data: Datavolo’s approach to processing unstructured data enables organizations to get the most out of their business data. With Datavolo, customers can exercise the full breadth of Pinecone’s capabilities by enriching their data with custom metadata and pre-computed scores. Datavolo automates everything between the extraction of data at rest and embeddings being stored in Pinecone, allowing customers to build truly operational GenAI tools.

Read about how we build RAG applications using Pinecone in our blog post, Data Engineering for Advanced RAG: Small-to-Big with Pinecone, LangChain, and Datavolo.

Example scenarios

Improve search and filtering with pre-computed values

Many agentic systems used in a business context will pull from data sourced across many different systems and in many formats. Subject matter experts within the organization often know ahead of time what sources are of the highest quality for specific applications and want to ensure these are preferred when an agent is called. Vector search can often be enhanced by providing additional pre-computed values for improved ranking or filtering. Some examples of pre-computed values are:

  • Providing a score representing the text author’s competency level, which can be measured by their post count 
  • Providing the document source, allowing users to filter their searches based on the channel from which data was collected
  • Providing an incrementing score derived from how recently the document was made. During re-ranking this puts more recent and relevant documents first.

Datavolo specializes in collecting, computing, and loading these types of pre-computed values in RAG flows. These are passed to Pinecone alongside vector embeddings and used to enhance a variety of usage patterns.

Multimodal CDC and document parsing

For a GenAI application to provide the best results in day-to-day usage, it needs to operate on the latest business data. Here, it’s imperative that changes to document content, metadata, and permissions are captured and propagated to downstream systems quickly and in an automated way. Datavolo’s connectors—such as those for Google Drive and Sharepoint—enable vectors in Pinecone to be amended as soon as source data changes while avoiding expensive re-indexing operations and nightly batch updates. Coupled with Datavolo’s document parsing and chunking capabilities, this amounts to a uniquely powerful Change Data Capture (CDC) stack.

Next Steps

To connect Datavolo and Pinecone, read our guide: Getting Started with Pinecone.

Then, start using Pinecone within your Runtimes with these supported processors:

  • UpsertPinecone: Publishes vectors, including metadata, and optionally text, to a Pinecone index
  • DeletePinecone: Deletes vectors from a Pinecone index
  • QueryPinecone: Queries Pinecone for vectors that are similar to the input vector, or retrieves a vector by ID