Setting the Stage
There are a few times in our lives when we feel the ground shifting under our feet due to seismic shifts in technology. You know these paradigm shifts are truly seismic when they lead to broader changes in society–the web, search engines, mobile, and social media all come to mind. Within tech, we’ve known about the promise and potential of machine learning for over a decade, but I think it’s quite clear that for broader audiences it was late 2022 when the age of AI clearly began. Bill Gates’s piece captures the scale of this potential impact, comparing the magic of his first experience with ChatGPT as reminiscent of his first experience with the graphical user interface decades prior.
It was four key breakthroughs that converged at this moment: 1) the transformer architecture invented by Google researchers in 2017 that showed impressive success in language translation tasks, 2) the GPU innovations led by Nvidia that supported the training of 100B+ parameter models, 3) the availability of massive amounts of language data on the web and in various knowledge repositories and 4) the first killer app that anyone could interact with in OpenAI’s ChatGPT. While it was the result of multidisciplinary innovation that led to this moment, I believe the availability of large-scale data and the ability to harness it was the most vital. This fact has been key to the defensibility of ad businesses built on prior waves of AI, like Facebook and Google, and to the moats that AI companies will seek to build in this next wave.
Why I joined Datavolo
Of course it’s natural for me to say I genuinely believe the crux of the breakthrough is tied to data, as my entire career has revolved around tackling different data challenges! After starting my career as a data engineer and working in solution architecture and sales engineering at Hortonworks, Databricks, and Unravel Data, I was most recently leading data and ML sales engineering teams at Google Cloud focused on serving enterprise software companies. Google is a special place with incredibly talented individuals and an impressive culture of creativity, and I’m so grateful for the time I spent there.
One of the biggest takeaways from my time working at cloud companies is just how important the move to cloud has been for standardizing and simplifying the enterprise architecture, thereby accelerating the innovation of software companies and their customers. Just as iOS unlocked velocity in consumer apps due its standard form factor, the cloud has unlocked enterprise software velocity and put software companies in a better position to delight their customers.
In addition to being eager to get back to my early stage roots, building something new, and deepening my technical focus, I knew that this was a fantastic time to be at a small and agile software company within a time of such seminal change. The burgeoning AI space has moved and will continue to move with tremendous speed, and success requires nimbleness and close proximity to customers and to the market. This moment aligns perfectly with customers’ embracing of cloud and managed offerings and the value software companies can deliver there.
Where Datavolo is headed
As an industry, we’ve been talking about the transformative potential of unstructured documents and data for over a decade, but this latest wave of breakthroughs, namely the ability of deep neural networks to empirically discover a semantic understanding of language through pre-training, is putting us in the strongest position to succeed yet.
However, AI apps that deliver against this transformative potential are proving challenging to build and to deploy. Developers’ initial instinct was to build apps directly on LLMs–in the absence of a fully-fledged AI stack–but hallucinations and the need to add contextual data drove them down the rabbit hole of fine tuning, as well as towards in-context learning patterns like RAG to anchor model output on contextual documents and data.
A great example of the distinction is Andrej Karpathy’s framing of hallucinations: “what people actually mean is they don’t want an [AI app] to hallucinate. An AI app is a lot more complex system than just the LLM itself, even if one is at the heart of it . . . using Retrieval Augmented Generation (RAG) to more strongly anchor [the model’s output] in real data through in-context learning is maybe the most common [way to mitigate hallucinations]”.
Even consensus around the best ways to implement recently-touted design patterns like RAG is evolving extremely quickly, meaning agility for AI teams is paramount. In recent months, I saw many customers get stuck in that purgatory between idea and production AI app as they navigated these nascent design patterns. A big factor is that key layers of the stack are not production-ready and still need to be refined as the problem space becomes better understood. In particular, working with large, disparate, unstructured documents requires new data engineering capabilities. It seems clear that the data engineering needs of AI app developers, and especially of the data teams supporting them, are not being fully met.
Datavolo is going to focus on adding value in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. We believe that users need a framework, a feature set, and repeatable patterns to build multimodal data pipelines which are production-ready. Our core team deeply understands the challenges of data within the enterprise, given our experience in prior waves of AI and big data innovation. Datavolo is powered by Apache NiFi which is designed from the ground up for multimodal data flows and is simple, scalable, and secure. I can’t wait to share more as our journey continues, please stay tuned and please reach out with feedback on what we’re building!