How custom code can add security risk to enterprise AI projects and LLMs

Data teams are actively delivering new architectures to propel AI innovation at a rapid pace. In this blog, we’ll explore how Datavolo empowers these teams to accelerate while addressing the critical aspects of security, observability, and maintenance for their data pipelines. We’ll discuss the risks associated with custom code within enterprises and the alternative approach of using low-code platforms like Datavolo, which can mitigate certain risks by transferring them to the software vendor, such as ensuring secure supply chains for dependencies. This post will outline Datavolo’s emphasis on pipeline maintainability and security, along with how our low-code platform facilitates rapid time-to-value through an extensive array of out-of-the-box processors and blueprints for multimodal data pipelines for AI.

Deleting Code

Experienced software engineers often champion code deletion for valid reasons. Having less code means a reduced surface area that requires maintenance, security measures, reliability checks, documentation, and testing. These tasks are crucial for software, Site Reliability Engineering (SRE), and data teams to effectively manage a code base and its associated data pipelines. Additionally, as businesses evolve, the software and data abstractions reflecting the business must evolve accordingly, presenting an ongoing challenge.

Now, deleting code can only be a good thing if the business is still able to achieve whatever the code was intended to enable in the first place! Let’s draw a distinction between the business’s own custom code and their vendors’ code and services–from which the business can derive value. In finance, there is an axiom that risk cannot be destroyed, only transferred. In a sense, businesses pay software vendors to transfer certain risks and burdens to them–the risk of insecure software, the risk of low-quality code, the risk of unmaintained code, and more.

When it comes to ensuring software security, contemporary development practices for application security play a pivotal role. At Datavolo, we offer comprehensive Software Bills of Material (SBOMs) for all our deployments, including dependencies and extensions integrated into our platform. Utilizing Oxeye, Datavolo runtimes can even identify and alert users about insecure dependencies in running code, including extensions.

Careful consideration of non-functional aspects like security, scalability, flexibility, and observability is often overlooked when prioritizing new code delivery for urgent business needs. As the code base expands, the repercussions of not adhering to best practices become more significant. In our experience, a large number of engineering challenges stem directly from poorly-written custom code. In our experience, a large number of engineering escalations are attributed directly to poorly-written, custom code. Ideally, the code you don’t write is the code where your vendor has found a best practice and served it up to you in their service or library!

Technical Debt

Even well-written code will deteriorate in quality over time when not maintained. This is akin to a second law of thermodynamics for code: software must evolve alongside the business and surrounding systems, or it will degrade. Unmaintained code and legacy architectures often contribute to technical debt, a long-term maintenance burden that consumes engineering resources, thereby impeding team velocity.

The key takeaway is that enterprises  must balance time-to-value with technical debt. Most software tools that are sold to the enterprise promise higher velocity and reduction of time-to-value, but what hangs in the balance is often massive technical debt, and shadow IT projects that are spawned as a result of frustration from the business. This can result in substantial spending to maintain legacy code and services, stifling innovation.

The majority of a code’s lifespan occurs after its initial creation. In large organizations, many engineers spend a significant portion of their time grappling with legacy codebases, reviewing and rectifying low-quality code written years ago. While time-to-value is paramount during initial delivery, ongoing maintenance and reliable service operation become predominant over its lifespan.

The Alternative

Instead of crafting custom code for new data engineering applications, users can opt for established data engineering platforms and to collaborate with vendors that can inventory important risks. At Datavolo, our team has been assisting data engineers in solving complex problems within the Apache NiFi community for almost a decade. We’ve curated a set of patterns and best practices, offering them as processors and templates to help data teams achieve their goals. For building multimodal data pipelines, we provide engineers with over 300 processors for extracting, chunking, transforming, and loading multimodal data for AI use cases. Alongside being secure, scalable, and user-friendly, Datavolo offers flexibility to seamlessly swap APIs and modify transformations, sources, destinations, and models. Datavolo users can efficiently reuse modular code, fostering collaboration and preventing redundant effort.

Datavolo is a platform equipped with a wide range of out-of-the-box processors and patterns for implementing data engineering pipelines for AI use cases. Our aim at Datavolo is to become the trusted partner capable of assuming risks associated with insecure software, low-quality code, and unmaintained code. We welcome the opportunity to establish that trust with your organization. Please don’t hesitate to reach out if you’d like to discuss further!

Top Related Posts

What is LLM Insecure Output Handling?

The Open Worldwide Application Security Project (OWASP) states that insecure output handling neglects to validate large language model (LLM) outputs that may lead to downstream security exploits, including code execution that compromises systems and exposes data. This...

Data Ingestion Strategies for GenAI Pipelines

You did it! You finally led the charge and persuaded your boss to let your team start working on a new generative AI application at work and you’re psyched to get started. You get your data and start the ingestion process but right when you think you’ve nailed it, you...

Prompt Injection Attack Explained

By now, it’s no surprise that we’ve all heard about prompt injection attacks affecting Large Language Models (LLMs). Since November 2023, prompt injection attacks have been wreaking havoc on many in house built chatbots and homegrown large language models. But what is...

Onward with ONNX® – How We Did It

Digging into new AI models is one of the most exciting parts of my job here at Datavolo. However, having a new toy to play with can easily be overshadowed by the large assortment of issues that come up when you’re moving your code from your laptop to a production...

Tutorial – How to Convert to ONNX®

Converting from Pytorch/Safetensors to ONNX® Given the advantages described in Onward With ONNX® we’ve taken the opinion that if it runs on ONNX that’s the way we want to go.  So while ONNX has a large model zoo we’ve had to convert a few models by hand.  Many models...

Secure Data Pipeline Observability in Minutes

Monitoring data flows for Apache NiFi has evolved quite a bit since its inception. What started generally with logs and processors sprinkled throughout the pipeline grew to Prometheus REST APIs and a variety of Reporting Tasks. These components pushed NiFi closer to...

How to Package and Deploy Python Processors for Apache NiFi

Introduction Support for Processors in native Python is one of the most notable new features in Apache NiFi 2. Each milestone version of NiFi 2.0.0 has enhanced Python integration, with milestone 3 introducing support for loading Python Processors from NiFi Archive...

Troubleshooting Custom NiFi Processors with Data Provenance and Logs

We at Datavolo like to drink our own champagne, building internal tooling and operational workflows on top of the Datavolo Runtime, our distribution of Apache NiFi. We’ve written about several of these services, including our observability pipeline and Slack chatbots....

Apache NiFi – designed for extension at scale

Apache NiFi acquires, prepares, and delivers every kind of data, and that is exactly what AI systems are hungry for.  AI systems require data from all over the spectrum of unstructured, structured, and multi-modal and the protocols of data transport are as varied...

Data Pipeline Observability is Key to Data Quality

In my recent article, What is Observability, I discussed how observability is crucial for understanding complex architectures and their interactions and dependencies between different system components. Data Observability, unlike Software Observability, aims to...