How to Package and Deploy Python Processors for Apache NiFi

Introduction

Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive files. NiFi 2.0.0-M3 aligns Python Processor loading with Java component loading, which provides a solid foundation for scalable extensibility.

Announcing the Hatch Datavolo NAR Plugin

To accompany the release of Apache NiFi 2.0.0-M3, Datavolo published the Hatch Datavolo NAR project, which provides a builder plugin to the Hatch project management tool. As a project of the Python Packaging Authority, Hatch supports building, managing, and publishing Python components. With a few additions to a Python project configuration, the Hatch Datavolo NAR plugin not only builds NiFi Archives, but also packages Python dependencies for subsequent deployment. Hatch supports continuous integration and delivery, making the Datavolo NAR plugin a natural fit for building maintainable solutions with Apache NiFi.

Hatch is not alone in the world of Python project management solutions, but with support for major operating systems, best practices for development lifecycle operations, and extensibility for additional features, it provides a straightforward solution for developing libraries and applications. Getting started with Hatch is easy with its template-based project creation command.

Packaging Custom Python Processors

Configuring a project with the Hatch Datavolo NAR plugin for packaging Python Processors involves straightforward updates to the project configuration.

A single Hatch command creates a new project for Python Processors.

hatch new processors

The command creates a project directory and prints the structure as follows.

processors
├── src
│   └── processors
│   	├── __about__.py
│   	└── __init__.py
├── tests
│   └── __init__.py
├── LICENSE.txt
├── README.md
└── pyproject.toml

The pyproject.toml configuration includes default values and placeholders for project metadata.

Adding hatch-datavolo-nar to the list of required libraries for the project build system enables the nar target argument for the hatch build command.

[build-system]
requires = ["hatchling", "hatch-datavolo-nar"]

The project configuration also requires a target section listing the package directory containing Python Processor classes.

[tool.hatch.build.targets.nar]
packages = ["src/processors"]

The last step required before creating the Python Processor class itself is defining dependencies. The follow configuration enables packaging and using the Python HTTP requests library.

[project]
dependencies = ["requests"]

The Apache NiFi Python Developer’s Guide provides examples of custom Processor classes to get started.

Running the hatch build command with the nar target downloads declared project dependencies for packaging together with custom Python Processors.

hatch build --target nar

The build command creates a versioned NAR in the dist directory. Copy the NAR file to the extensions directory of an Apache NiFi installation to start building a custom data pipeline.

Conclusion

The Hatch Datavolo NAR project is open sourced under the Apache License Version 2.0. The source code is available in the hatch-datavolo-nar project on GitHub. Datavolo publishes project releases to the Python Package Index, using Hatch for build automation.

The NAR plugin for Hatch enables shift left security for Python Processors. The project highlights Datavolo’s commitment to enterprise security for data pipeline development and deployment. Packaging code and dependencies enables scanning to reduce the risks surrounding custom code, and supports repeatable deployments based on software development best practices.

Top Related Posts

What is LLM Insecure Output Handling?

The Open Worldwide Application Security Project (OWASP) states that insecure output handling neglects to validate large language model (LLM) outputs that may lead to downstream security exploits, including code execution that compromises systems and exposes data. This...

Data Ingestion Concerns for GenAI Pipelines

You did it! You finally led the charge and persuaded your boss to let your team start working on a new generative AI application at work and you’re psyched to get started. You get your data and start the ingestion process but right when you think you’ve nailed it, you...

Prompt Injection Attack Explained

By now, it’s no surprise that we’ve all heard about prompt injection attacks affecting Large Language Models (LLMs). Since November 2023, prompt injection attacks have been wreaking havoc on many in house built chatbots and homegrown large language models. But what is...

Onward with ONNX® – How We Did It

Digging into new AI models is one of the most exciting parts of my job here at Datavolo. However, having a new toy to play with can easily be overshadowed by the large assortment of issues that come up when you’re moving your code from your laptop to a production...

Tutorial – How to Convert to ONNX®

Converting from Pytorch/Safetensors to ONNX® Given the advantages described in Onward With ONNX® we’ve taken the opinion that if it runs on ONNX that’s the way we want to go.  So while ONNX has a large model zoo we’ve had to convert a few models by hand.  Many models...

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...

Data Pipeline Observability is Key to Data Quality

In my recent article, What is Observability, I discussed how observability is crucial for understanding complex architectures and their interactions and dependencies between different system components. Data Observability, unlike Software Observability, aims to...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...