You know how this conversation goes. There’s a new file workflow to automate.
Maybe you need to drop vendor invoices into an S3 bucket.
Or maybe you are moving processed reports from a staging server to a partner SFTP endpoint every night at 2 a.m.
Or maybe customers are uploading receipts for covered services.
Someone on the team says, “We can just write a script for that.” And they’re right. You can.
It’ll take a few hours, maybe a day. The script will work.
Until it doesn’t…
This is an age-old story of how organizations accumulate technical debt. Specifically, script debt. The slow, invisible accumulation of operational risk that builds up every time someone automates a file workflow with a shell script and a cron job.
Individually, each script looks like a good simple solution. Collectively, they become a brittle, undocumented, unmonitored infrastructure layer that sits underneath your most critical business operations and waits for the worst possible moment to fail.
And when it does fail, nobody knows why. Often time, the person who wrote it isn’t around anymore. And if they are around, they don’t remember what it does or how it does it.
Sometimes nobody even notices that it failed.
Scripts Are Infrastructure. Treat Them That Way.
The core problem isn’t that scripts are written. The problem is that they are written as a simple, tactical kluge. But they are deployed as part of your application’s core infrastructure.
A script that runs every night at 2 a.m. moving files to a partner endpoint isn’t a utility. It’s a business-critical workflow. The distinction matters enormously when something goes wrong.
Infrastructure requires observability. It requires failure handling. It requires documentation, ownership, and a recovery path. Scripts, as typically written in most business environments, have none of these things.
They have a happy path, and they have hope. Nothing more.
And, of course, the happy path will break, someday. The remote endpoint will become temporarily unreachable. The file will arrive in an unexpected format. The server will run out of disk space mid-transfer.
When the happy path isn’t so happy anymore, most scripts do one of two things:
1. They silently do nothing, or
2. They crash and send an error message to a log that nobody is watching
Either outcome is a problem. Neither is acceptable for business-critical workflows.
But that’s the infrastructure you’ve built.
A script that runs every night moving files to a partner endpoint isn’t a utility. It’s a business-critical workflow. And the distinction matters enormously when something goes wrong.
The Hidden Cost: What You Can’t See
There are very real hidden costs associated with these scripts. The cost isn’t just the time spent debugging when things go wrong, though that cost is real. It’s also the costs that don’t show up anywhere because they’re the absence of something, rather than the presence of something that failed.
Consider these examples:
- A file transfer fails silently at 2:47 a.m. The downstream system that expected that file spends the next six hours processing with missing data. No alert fires. No human notices until a business analyst asks why yesterday’s reports look wrong.
- A script that moves sensitive financial data runs successfully every night for two years. Then an auditor asks you to demonstrate who had access to those files, which systems touched them, and whether the transfer was encrypted in transit. You have no answer, because the script didn’t record any of it.
- The engineer who wrote the critical SFTP automation leaves the company. The script keeps running. Nobody quite knows what it does, what it connects to, or what happens if you need to change it. And then, it stops working. Everyone scratches their heads trying to figure out what to do about it. This is the classic critical infrastructure that is not documented and is not owned. It’s a catastrophe waiting to happen.
These aren’t hypothetical scenarios. They’re descriptions of what actually happens in organizations that have been running script-driven file automation for more than a year or two. The scripts accumulate. The knowledge about them doesn’t.
The Knowledge Silo Problem Is Worse Than You Think
Every organization I’ve worked with that relies heavily on custom file automation has a version of the same person: the engineer who knows how it all works.
Maybe they wrote most of the scripts. Maybe they’ve just been around long enough to understand the interdependencies. Either way, they’re the single point of failure for a non-trivial portion of the company’s operational at-risk infrastructure.
What happens when that person leaves, or is sick, or on vacation, or just isn’t available at 3 a.m. when the workflow breaks? You have a problem. Not a theoretical problem. A real, operational, “the business isn’t working right now” problem.
This is a predictable consequence of treating infrastructure as individual craftsmanship rather than engineered systems. Scripts live in someone’s head. If you’re lucky, they might be in a GIT repo. They don’t have runbooks. They don’t have documented dependencies. They don’t surface their current state or operational health in a way that anyone other than their author can interpret quickly under pressure.
The organizational cost of that dependency is enormous and chronically underestimated. Most teams don’t account for it at all until the knowledge walks out the door.
Compliance and Audit: The Reckoning
If you operate in a regulated industry, such as financial services, healthcare, or government contracting, the script problem takes on an additional dimension. You simply can’t audit what your scripts didn’t record.
SOC 2, HIPAA, PCI-DSS, and similar compliance frameworks require demonstrable control over data in transit. That means knowing, with specificity and on demand, what files moved, when, to where, with what encryption, and accessed by which credentials. A cron job that runs a bash script to push files over SFTP just doesn’t give you any of that unless you explicitly built it in. And given the environment and “just write a script” attitude that most of these scripts are written under, most don’t have the required auditing built in.
The result is a compliance gap that often isn’t discovered until an audit. And at that point, the answer of “we should have instrumented this better” isn’t an acceptable answer to an auditor who wants chain-of-custody evidence for two years of sensitive file transfers.
You can’t audit what your scripts didn’t record. And most scripts weren’t written to record anything.
The architecture tax here is steep. You either build audit-level instrumentation into every script that you write, which is impractical and burdensome in its own way. Or you accept the risk of a compliance issue and security vulnerability. Neither is a good option. Both have predictable consequences.
But more often than not, you don’t actively choose between the two options, you simply don’t think about the problem at all, until it’s too late.
Failure Propagation: One Script, Many Downstream Problems
File workflows are rarely isolated. The output of one is almost always the input of another. A file arrives, triggers a process, produces an output, which feeds a downstream system. This is normal pipeline design.
But as the old saying goes, Garbage In, Garbage Out. Once a workflow starts to fail, the entire downstream process is corrupted. It’s exactly why a silent failure in a script-driven workflow is so dangerous. When a step fails silently, the failure propagates. The downstream system runs on stale, incomplete, or missing data. Reports are wrong. Reconciliation fails. Partner integrations produce bad output that nobody notices until the partner calls.
The blast radius of a single script failure is rarely proportional to how small and simple the script appeared to be.
This is a well-understood problem in software systems design. Resilience requires explicit failure handling, retry logic, alerting, and dead-letter patterns for workflows that can’t complete. You can build all of that into shell scripts. But, just like compliance and auditing, almost nobody does. This is because the initial framing was “we’ll just script it”, which implies “we can do this simply and easily”. Yet, doing so means ignoring fundamental critical elements, like failure handling and alerting.
This Is an Architecture Problem, not a Tooling Problem
I want to be clear about the actual diagnosis here. The problem isn’t that engineers write scripts. Scripts are useful, appropriate, and sometimes exactly right for the job. The problem is that organizations use scripts to solve an infrastructure problem that requires an infrastructure solution.
When your file workflows grow to the point where you have more than a handful of them…
- where they touch regulated data…
- where they’re part of business-critical processes…
- where partners and external systems depend on their reliability…
…you have a file infrastructure problem. And file infrastructure problems require file infrastructure solutions.
What does a real file infrastructure solution mean? It means:
- Centralized visibility into what’s running, what’s succeeded, and what’s failed
- Explicit failure handling with alerting and retry semantics
- Immutable audit trails that capture who, what, when, where, and how for every file movement
- Workflow definitions that are documented, version-controlled, and transferable, not simply locked in someone’s head
- Access controls that are granular, auditable, and consistently enforced
These are engineering requirements. They’re not optional extras. If your current file automation doesn’t provide them, you have technical debt. And like all technical debt, the interest compounds. The longer you run critical workflows on an infrastructure foundation that can’t support it, the higher the eventual cost of the reckoning.
The Script That’s Running Right Now
Think about these questions for a moment:
- How many scripts are running in your environment right now that nobody on your current team fully understands?
- How many of those scripts move data that’s subject to compliance requirements?
- How many of them have failure handling that amounts to “nothing,” and alerting that amounts to “the downstream team notices something’s wrong”?
For most organizations, the honest answer to those questions is uncomfortable.
The good news is that this is a solvable problem. Not by simply rewriting all the scripts (though some of that work is probably inevitable), but by making a deliberate architectural decision about how file workflows should be managed in your environment.
The “we’ll just script it” instinct isn’t wrong. It’s just the wrong answer for the wrong problem.
– Lee Atchison is Field CTO at Files.com and the author of Architecting for Scale (O’Reilly Media). He writes on cloud architecture, enterprise infrastructure, security, and software scalability.