Select Page

Field CTO Perspectives: Why Datavolo and Why Now?

Setting the Stage

There are a few times in our lives when we feel the ground shifting under our feet due to seismic shifts in technology. You know these paradigm shifts are truly seismic when they lead to broader changes in society–the web, search engines, mobile, and social media all come to mind. Within tech, we’ve known about the promise and potential of machine learning for over a decade, but I think it’s quite clear that for broader audiences it was late 2022 when the age of AI clearly began. Bill Gates’s piece captures the scale of this potential impact, comparing the magic of his first experience with ChatGPT as reminiscent of his first experience with the graphical user interface decades prior.

It was four key breakthroughs that converged at this moment: 1) the transformer architecture invented by Google researchers in 2017 that showed impressive success in language translation tasks, 2) the GPU innovations led by Nvidia that supported the training of 100B+ parameter models, 3) the availability of massive amounts of language data on the web and in various knowledge repositories and 4) the first killer app that anyone could interact with in OpenAI’s ChatGPT. While it was the result of multidisciplinary innovation that led to this moment, I believe the availability of large-scale data and the ability to harness it was the most vital. This fact has been key to the defensibility of ad businesses built on prior waves of AI, like Facebook and Google, and to the moats that AI companies will seek to build in this next wave.

Why I joined Datavolo

Of course it’s natural for me to say I genuinely believe the crux of the breakthrough is tied to data, as my entire career has revolved around tackling different data challenges! After starting my career as a data engineer and working in solution architecture and sales engineering at Hortonworks, Databricks, and Unravel Data, I was most recently leading data and ML sales engineering teams at Google Cloud focused on serving enterprise software companies. Google is a special place with incredibly talented individuals and an impressive culture of creativity, and I’m so grateful for the time I spent there.

One of the biggest takeaways from my time working at cloud companies is just how important the move to cloud has been for standardizing and simplifying the enterprise architecture, thereby accelerating the innovation of software companies and their customers. Just as iOS unlocked velocity in consumer apps due its standard form factor, the cloud has unlocked enterprise software velocity and put software companies in a better position to delight their customers.

In addition to being eager to get back to my early stage roots, building something new, and deepening my technical focus, I knew that this was a fantastic time to be at a small and agile software company within a time of such seminal change. The burgeoning AI space has moved and will continue to move with tremendous speed, and success requires nimbleness and close proximity to customers and to the market. This moment aligns perfectly with customers’ embracing of cloud and managed offerings and the value software companies can deliver there.

Where Datavolo is headed

As an industry, we’ve been talking about the transformative potential of unstructured documents and data for over a decade, but this latest wave of breakthroughs, namely the ability of deep neural networks to empirically discover a semantic understanding of language through pre-training, is putting us in the strongest position to succeed yet. 

However, AI apps that deliver against this transformative potential are proving challenging to build and to deploy. Developers’ initial instinct was to build apps directly on LLMs–in the absence of a fully-fledged AI stack–but hallucinations and the need to add contextual data drove them down the rabbit hole of fine tuning, as well as towards in-context learning patterns like RAG to anchor model output on contextual documents and data. 

A great example of the distinction is Andrej Karpathy’s framing of hallucinations: “what people actually mean is they don’t want an [AI app] to hallucinate. An AI app is a lot more complex system than just the LLM itself, even if one is at the heart of it . . . using Retrieval Augmented Generation (RAG) to more strongly anchor [the model’s output] in real data through in-context learning is maybe the most common [way to mitigate hallucinations]”.

Even consensus around the best ways to implement recently-touted design patterns like RAG is evolving extremely quickly, meaning agility for AI teams is paramount. In recent months, I saw many customers get stuck in that purgatory between idea and production AI app as they navigated these nascent design patterns. A big factor is that key layers of the stack are not production-ready and still need to be refined as the problem space becomes better understood. In particular, working with large, disparate, unstructured documents requires new data engineering capabilities. It seems clear that the data engineering needs of AI app developers, and especially of the data teams supporting them, are not being fully met.

Datavolo is going to focus on adding value in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. We believe that users need a framework, a feature set, and repeatable patterns to build multimodal data pipelines which are production-ready. Our core team deeply understands the challenges of data within the enterprise, given our experience in prior waves of AI and big data innovation. Datavolo is powered by Apache NiFi which is designed from the ground up for multimodal data flows and is simple, scalable, and secure. I can’t wait to share more as our journey continues, please stay tuned and please reach out with feedback on what we’re building!

Top Related Posts

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...

Fueling your Chatbots with Slack

The true power of chatbots is not in how much the large language model (LLM) powering it understands. It’s the ability to provide relevant, organization-specific information to the LLM so that it can provide a natural language interface to vast amounts of data. That...

Datavolo Architecture Viewpoint

The Evolving AI Stack Datavolo is going to play in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. The value of any stack is determined by the app layer, as we saw with Windows, iOS, and countless other...

ETL is dead, long live ETL (for multimodal data)

Why did ELT become the most effective pattern for structured data? A key innovation in the past decade that unlocked the modern data stack was the decoupling of storage and compute enabled by cloud data warehouses as well as cloud data platforms like Databricks. This...

FlowGen Improvements (already!)

In the past week, since Datavolo released its Flow Generation capability, we've witnessed fantastic adoption as users have eagerly requested flows from the Flow Generation bot. We're excited to share that we have recently upgraded our models, enhancing both the power...

The Evolution of AI Engineering and Datavolo’s Role

Humility is the first lesson In the machine learning era of software engineering, one persistent truth has emerged: engineers are increasingly submitting to the will of the machine. A significant milestone in the transition from classical machine learning to deep...

Introducing our GenAI NiFi Flow Builder!

Hey everyone, it's been an incredible journey over the past ten years since we open-sourced Apache NiFi. Right from the beginning, our mission with NiFi was crystal clear: to make it easier for all of you to gather data from...