How to Package and Deploy Python Processors for Apache NiFi

Introduction

Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive files. NiFi 2.0.0-M3 aligns Python Processor loading with Java component loading, which provides a solid foundation for scalable extensibility.

Announcing the Hatch Datavolo NAR Plugin

To accompany the release of Apache NiFi 2.0.0-M3, Datavolo published the Hatch Datavolo NAR project, which provides a builder plugin to the Hatch project management tool. As a project of the Python Packaging Authority, Hatch supports building, managing, and publishing Python components. With a few additions to a Python project configuration, the Hatch Datavolo NAR plugin not only builds NiFi Archives, but also packages Python dependencies for subsequent deployment. Hatch supports continuous integration and delivery, making the Datavolo NAR plugin a natural fit for building maintainable solutions with Apache NiFi.

Hatch is not alone in the world of Python project management solutions, but with support for major operating systems, best practices for development lifecycle operations, and extensibility for additional features, it provides a straightforward solution for developing libraries and applications. Getting started with Hatch is easy with its template-based project creation command.

Packaging Custom Python Processors

Configuring a project with the Hatch Datavolo NAR plugin for packaging Python Processors involves straightforward updates to the project configuration.

A single Hatch command creates a new project for Python Processors.

hatch new processors

The command creates a project directory and prints the structure as follows.

processors
├── src
│   └── processors
│   	├── __about__.py
│   	└── __init__.py
├── tests
│   └── __init__.py
├── LICENSE.txt
├── README.md
└── pyproject.toml

The pyproject.toml configuration includes default values and placeholders for project metadata.

Adding hatch-datavolo-nar to the list of required libraries for the project build system enables the nar target argument for the hatch build command.

[build-system]
requires = ["hatchling", "hatch-datavolo-nar"]

The project configuration also requires a target section listing the package directory containing Python Processor classes.

[tool.hatch.build.targets.nar]
packages = ["src/processors"]

The last step required before creating the Python Processor class itself is defining dependencies. The follow configuration enables packaging and using the Python HTTP requests library.

[project]
dependencies = ["requests"]

The Apache NiFi Python Developer’s Guide provides examples of custom Processor classes to get started.

Running the hatch build command with the nar target downloads declared project dependencies for packaging together with custom Python Processors.

hatch build --target nar

The build command creates a versioned NAR in the dist directory. Copy the NAR file to the extensions directory of an Apache NiFi installation to start building a custom data pipeline.

Conclusion

The Hatch Datavolo NAR project is open sourced under the Apache License Version 2.0. The source code is available in the hatch-datavolo-nar project on GitHub. Datavolo publishes project releases to the Python Package Index, using Hatch for build automation.

The NAR plugin for Hatch enables shift left security for Python Processors. The project highlights Datavolo’s commitment to enterprise security for data pipeline development and deployment. Packaging code and dependencies enables scanning to reduce the risks surrounding custom code, and supports repeatable deployments based on software development best practices.

Top Related Posts

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...

Data Pipeline Observability is Key to Data Quality

In my recent article, What is Observability, I discussed how observability is crucial for understanding complex architectures and their interactions and dependencies between different system components. Data Observability, unlike Software Observability, aims to...

Building GenAI enterprise applications with Vectara and Datavolo

The Vectara and Datavolo integration and partnership When building GenAI apps that are meant to give users rich answers to complex questions or act as an AI assistant (chatbot), we often use Retrieval Augmented Generation (RAG) and want to ground the responses on...

Datavolo Announces Over $21M in Funding!

Datavolo Raises Over $21 Million in Funding from General Catalyst and others to Solve Multimodal Data Pipelines for AI Phoenix, AZ, April 2, 2024 – Datavolo, the leader in multimodal data pipelines for AI, announced today that it has raised over $21 million in...

Collecting Logs with Apache NiFi and OpenTelemetry

Introduction OpenTelemetry has become a unifying force for software observability, providing a common vocabulary for describing logs, metrics, and traces. With interfaces and instrumentation capabilities in multiple programming languages, OTel presents a compelling...

Custom code adds risk to the enterprise

Data teams are actively delivering new architectures to propel AI innovation at a rapid pace. In this blog, we’ll explore how Datavolo empowers these teams to accelerate while addressing the critical aspects of security, observability, and maintenance for their data...

Datavolo Architecture Viewpoint

The Evolving AI Stack Datavolo is going to play in three layers of the evolving AI stack: data pipelines, orchestration, and observability & governance. The value of any stack is determined by the app layer, as we saw with Windows, iOS, and countless other...