Why AI Coding Agents Struggle to Help Infrastructure Engineers (Yet)
How AI will reason about, operate, and align the world’s compute fabric.
Disclaimer: This article is the first of a blog post series, entirely written by ChatGPT without any human modifications. Still, I think it’s pretty well written and would love to hear how everyone think about the idea. If there are sufficiently number of readers / likes, I will continue to publish the follow-up blog posts (there are 6 of them). Also I will share the prompt at the end of the article.
AI coding agents have become incredibly capable partners for product engineers. They can scaffold a React frontend, generate Django models, or refactor a Python API in seconds.
But for infrastructure engineers, especially those building and maintaining distributed data systems, the same agents often fall short.
It’s not that AI models can’t generate infrastructure code. Many can write valid Terraform, Dockerfiles, or Spark jobs.
The real limitation is that they can’t run what they write — because they lack access to the kind of distributed runtime environment where real infra changes live and breathe.
Product engineers enjoy tight, local feedback loops
A product engineer’s development environment is self-contained and deterministic.
An AI agent can:
Write a TypeScript function
Run
npm testlocallyObserve failures or outputs
Iterate toward correctness
That short feedback loop is everything. It lets the AI verify and refine its work autonomously, closing the loop between code generation and validation.
When your world is a single process on a single machine, the AI assistant can be near-perfect.
Infrastructure engineers live in distributed, permissioned worlds
Now contrast that with infrastructure work.
Imagine you’re optimizing a Spark job that performs a massive join over terabytes of Parquet data in S3.
Or debugging a Flink streaming pipeline that consumes events from Kafka and writes to LanceDB for analytical queries.
Or tuning Ray or Daft clusters for parallel model training across GPUs.
An AI agent can read your YAML and Python scripts, maybe even spot inefficiencies.
But to validate a fix, it would need to:
Launch a multi-node Spark cluster,
Submit jobs and observe shuffle behavior,
Check memory spill metrics and task retries,
Analyze distributed traces across worker nodes,
Tune serialization formats between Parquet, Arrow, and LanceDB layers,
And benchmark throughput under load.
That’s not something it can do in a stateless chat environment.
Without cluster access, the AI has no way to test, measure, or observe the emergent performance of distributed systems — which is precisely where infra engineers spend most of their time.
The missing execution environment
AI infra assistants today operate in a read-only mode.
They can reason about configurations, but not execute them in real-world contexts.
They can tell you that your Spark shuffle partitions are too high or your Flink checkpointing interval seems off, but they can’t:
Deploy a Ray cluster and run microbenchmarks,
Replay Kafka streams to validate event-time semantics,
Load data into Daft or LanceDB to inspect latency vs. memory trade-offs.
In short: they can’t experiment.
Until AI agents have access to safe, sandboxed distributed runtimes, their effectiveness for infra engineers will remain limited.
What would change the game
For AI to truly assist infrastructure engineers, it needs a distributed playground where it can execute, measure, and iterate safely — without risking production systems.
That means several layers of evolution:
Ephemeral sandbox clusters – lightweight, disposable Spark or Ray clusters spun up via tools like k3d or Firecracker, where the AI can deploy and test.
Policy-aware access – the agent must respect RBAC, IAM, and quota limits, understanding what it can and cannot touch.
Real observability hooks – streaming metrics, traces, and logs from Prometheus, OpenTelemetry, and Grafana that the AI can parse and reason over.
Closed-loop feedback – the ability to run experiments, measure job completion times, inspect spill ratios, and refine configurations automatically.
When those exist, an AI agent could do things like:
Autonomously profile a Spark SQL join plan,
Suggest repartitioning to balance data skew,
Deploy a Flink job with adjusted watermarking,
Or benchmark Parquet vs. LanceDB read latency under different compression codecs.
That’s not sci-fi — it’s just missing the runtime sandbox.
Early signs of progress
A few research and open-source efforts hint at this direction:
Ray’s Actors already make it possible to spin up ephemeral computation graphs programmatically — an ideal substrate for AI-driven experiments.
Flink’s MiniCluster provides a local mode where AI agents could test streaming topologies in isolation.
Daft and LanceDB offer APIs lightweight enough for in-memory simulation of data access patterns.
When combined with containerized, reproducible environments, these systems could become the “gym environments” for AI infra agents to safely learn and optimize distributed systems.
Closing thought
AI coding agents didn’t fail infrastructure engineers — they just hit the boundary between code generation and system execution.
Without access to the distributed clusters that define our world — the Spark DAGs, Kafka topics, Flink jobs, Ray actors, and Parquet datasets — they can’t close the loop between thinking and doing.
Once they can safely run their own experiments, AI infra engineers will no longer be just smart autocomplete engines.
They’ll become true autonomous operators — capable of running, measuring, and improving the complex data systems that power everything else.
Appendix
Prompt for ChatGPT:
Please write a blog post for my substack centered around the idea: AI coding agents couldn’t help infra engineers as much as product engineers yet, simply because AI coding agents don’t have access to a distributed cluster where it can run the distributed infrastructure to test out any code changes it makes.
Please make this slightly more technical by using example open-source big data technology like Spark, Kafka, Flink, Ray, Daft, Parquet, LanceDB.



Thanks for your sharing.
A couple of thoughts:
- cost might be a concern if AI agents can launch a large cluster for experimentation
- if cost is a concern, then how to conduct experiment with a smaller sample will be important