Field CTO Perspectives: Why Datavolo and Why Now?

Setting the Stage

There are a few times in our lives when we feel the ground shifting under our feet due to seismic shifts in technology. You know these paradigm shifts are truly seismic when they lead to broader changes in society–the web, search engines, mobile, and social media all come to mind. Within tech, we’ve known about the promise and potential of machine learning for over a decade, but I think it’s quite clear that for broader audiences it was late 2022 when the age of AI clearly began. Bill Gates’s piece captures the scale of this potential impact, comparing the magic of his first experience with ChatGPT as reminiscent of his first experience with the graphical user interface decades prior.

It was four key breakthroughs that converged at this moment: 1) the transformer architecture invented by Google researchers in 2017 that showed impressive success in language translation tasks, 2) the GPU innovations led by Nvidia that supported the training of 100B+ parameter models, 3) the availability of massive amounts of language data on the web and in various knowledge repositories and 4) the first killer app that anyone could interact with in OpenAI’s ChatGPT. While it was the result of multidisciplinary innovation that led to this moment, I believe the availability of large-scale data and the ability to harness it was the most vital. This fact has been key to the defensibility of ad businesses built on prior waves of AI, like Facebook and Google, and to the moats that AI companies will seek to build in this next wave.

Why I joined Datavolo

Of course it’s natural for me to say I genuinely believe the crux of the breakthrough is tied to data, as my entire career has revolved around tackling different data challenges! After starting my career as a data engineer and working in solution architecture and sales engineering at Hortonworks, Databricks, and Unravel Data, I was most recently leading data and ML sales engineering teams at Google Cloud focused on serving enterprise software companies. Google is a special place with incredibly talented individuals and an impressive culture of creativity, and I’m so grateful for the time I spent there.

One of the biggest takeaways from my time working at cloud companies is just how important the move to cloud has been for standardizing and simplifying the enterprise architecture, thereby accelerating the innovation of software companies and their customers. Just as iOS unlocked velocity in consumer apps due its standard form factor, the cloud has unlocked enterprise software velocity and put software companies in a better position to delight their customers.

In addition to being eager to get back to my early stage roots, building something new, and deepening my technical focus, I knew that this was a fantastic time to be at a small and agile software company within a time of such seminal change. The burgeoning AI space has moved and will continue to move with tremendous speed, and success requires nimbleness and close proximity to customers and to the market. This moment aligns perfectly with customers’ embracing of cloud and managed offerings and the value software companies can deliver there.

Where Datavolo is headed

As an industry, we’ve been talking about the transformative potential of unstructured documents and data for over a decade, but this latest wave of breakthroughs, namely the ability of deep neural networks to empirically discover a semantic understanding of language through pre-training, is putting us in the strongest position to succeed yet. 

However, AI apps that deliver against this transformative potential are proving challenging to build and to deploy. Developers’ initial instinct was to build apps directly on LLMs–in the absence of a fully-fledged AI stack–but hallucinations and the need to add contextual data drove them down the rabbit hole of fine tuning, as well as towards in-context learning patterns like RAG to anchor model output on contextual documents and data. 

A great example of the distinction is Andrej Karpathy’s framing of hallucinations: “what people actually mean is they don’t want an [AI app] to hallucinate. An AI app is a lot more complex system than just the LLM itself, even if one is at the heart of it . . . using Retrieval Augmented Generation (RAG) to more strongly anchor [the model’s output] in real data through in-context learning is maybe the most common [way to mitigate hallucinations]”.

Even consensus around the best ways to implement recently-touted design patterns like RAG is evolving extremely quickly, meaning agility for AI teams is paramount. In recent months, I saw many customers get stuck in that purgatory between idea and production AI app as they navigated these nascent design patterns. A big factor is that key layers of the stack are not production-ready and still need to be refined as the problem space becomes better understood. In particular, working with large, disparate, unstructured documents requires new data engineering capabilities. It seems clear that the data engineering needs of AI app developers, and especially of the data teams supporting them, are not being fully met.

Datavolo is going to focus on adding value in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. We believe that users need a framework, a feature set, and repeatable patterns to build multimodal data pipelines which are production-ready. Our core team deeply understands the challenges of data within the enterprise, given our experience in prior waves of AI and big data innovation. Datavolo is powered by Apache NiFi which is designed from the ground up for multimodal data flows and is simple, scalable, and secure. I can’t wait to share more as our journey continues, please stay tuned and please reach out with feedback on what we’re building!

Top Related Posts

What is LLM Insecure Output Handling?

The Open Worldwide Application Security Project (OWASP) states that insecure output handling neglects to validate large language model (LLM) outputs that may lead to downstream security exploits, including code execution that compromises systems and exposes data. This...

Prompt Injection Attack Explained

By now, it’s no surprise that we’ve all heard about prompt injection attacks affecting Large Language Models (LLMs). Since November 2023, prompt injection attacks have been wreaking havoc on many in house built chatbots and homegrown large language models. But what is...

Generative AI – State of the Market – June 17, 2024

GenAI in the enterprise is still in its infancy.  The excitement and potential is undeniable.  However, enterprises have struggled to derive material value from GenAI and the hype surrounding this technology is waning.  We have talked with hundreds of organizations...

Streamlining Trade Finance Operations: Cleareye.ai Chooses Datavolo

In the ever-evolving landscape of trade finance, digitization and compliance automation are paramount for efficiency and regulatory adherence. Enter Cleareye.ai, a pioneering force in the industry. Their digital workbench, ClearTrade®, revolutionizes trade finance...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...

Fueling your AI Chatbots with Slack

The true power of chatbots is not in how much the large language model (LLM) powering it understands. It’s the ability to provide relevant, organization-specific information to the LLM so that it can provide a natural language interface to vast amounts of data. That...

Datavolo Architecture Viewpoint

The Evolving AI Stack Datavolo is going to play in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. The value of any stack is determined by the app layer, as we saw with Windows, iOS, and countless other...

ETL is dead, long live ETL (for multimodal data)

Why did ELT become the most effective pattern for structured data? A key innovation in the past decade that unlocked the modern data stack was the decoupling of storage and compute enabled by cloud data warehouses as well as cloud data platforms like Databricks. This...