NiFi FlowGen Improvements at Datavolo (already!)

In the past week, since Datavolo released its Flow Generation capability, we’ve witnessed fantastic adoption as users have eagerly requested flows from the Flow Generation bot. We’re excited to share that we have recently upgraded our models, enhancing both the power and accuracy of flow generation. Additionally, we’ve introduced several key new features.

One of the most notable improvements is the enhanced accuracy of the model. Specifically, we have refined the process of selecting the most relevant Processors for sources and sinks. Furthermore, we have fine-tuned the logic for identifying the appropriate Controller Service for your specific use case.

The significant improvements in accuracy alone have justified the release of a new model version. The flow requests we’ve received through Slack have been incredibly insightful and have led us to enable additional capabilities. For instance, you can now request the Flow Generator to create a flow that utilizes NiFi’s Stateless Execution Engine. This engine offers various runtime trade-offs, most notably shifting the “transactional boundary” from the Processor level to the Process Group level. This allows for the consumption of messages from durable stores like Apache Kafka, JMS, or Amazon Kinesis without acknowledging the messages until processing is complete. Consequently, messages will be redelivered in case of processing failures.

The ability to have NiFi process a single FlowFile at a time can be achieved through several approaches. The Flow Generation bot can now handle this task for you if you request it. For example, you might ask it to “Create a flow that <insert processing logic>… Only process one file at a time.”

At Datavolo, we are dedicated to providing the best and most accurate flows possible. However, we acknowledge that anything generated using Generative AI may introduce inaccuracies. Moreover, we recognize that accuracy can vary under different circumstances. We have enhanced our bot to make these particular circumstances more transparent.

If the bot cannot find a suitable Processor for a specific task, we will now convey this information, along with helpful insights,  in the Slack message. In other cases, the model will select a Processor and indicate in the NiFi flow that this particular Processor should be carefully reviewed. For instance, consider a scenario where there is a typo in your message, and you ask the bot to “Generate a flow that fetches data from S3, flumps the data, and then sends it to GCS.”

The model may omit the step mentioning “flumping the data” and include a warning in the message, such as “The term ‘flump’ is not a standard data processing term, and it is not clear what specific transformation it refers to. Assuming it means a generic transformation or processing, we can use a processor like JoltTransformRecord or ScriptedTransformRecord to apply the required transformation.” Alternatively, it may insert a JoltTransformRecord Processor and prominently label it with this warning:

Even without typos, there may be situations where the model’s confidence is low. For instance, if you ask it to send data to an endpoint it is not familiar with or perform a transformation that the model is uncertain about.

While flow generation is a powerful capability on its own, the ability to swiftly identify and highlight areas that require particular attention translates into even faster time to production!

We are thrilled not only to offer this capability but also to see many users embracing it eagerly, and witnessing improvements emerging rapidly. If you haven’t already, we invite you to join our Slack Community and experience this capability for yourself!

Top Related Posts

What is LLM Insecure Output Handling?

The Open Worldwide Application Security Project (OWASP) states that insecure output handling neglects to validate large language model (LLM) outputs that may lead to downstream security exploits, including code execution that compromises systems and exposes data. This...

Prompt Injection Attack Explained

By now, it’s no surprise that we’ve all heard about prompt injection attacks affecting Large Language Models (LLMs). Since November 2023, prompt injection attacks have been wreaking havoc on many in house built chatbots and homegrown large language models. But what is...

Generative AI – State of the Market – June 17, 2024

GenAI in the enterprise is still in its infancy.  The excitement and potential is undeniable.  However, enterprises have struggled to derive material value from GenAI and the hype surrounding this technology is waning.  We have talked with hundreds of organizations...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...

Fueling your AI Chatbots with Slack

The true power of chatbots is not in how much the large language model (LLM) powering it understands. It’s the ability to provide relevant, organization-specific information to the LLM so that it can provide a natural language interface to vast amounts of data. That...

Datavolo Architecture Viewpoint

The Evolving AI Stack Datavolo is going to play in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. The value of any stack is determined by the app layer, as we saw with Windows, iOS, and countless other...

ETL is dead, long live ETL (for multimodal data)

Why did ELT become the most effective pattern for structured data? A key innovation in the past decade that unlocked the modern data stack was the decoupling of storage and compute enabled by cloud data warehouses as well as cloud data platforms like Databricks. This...