Why Every Infrastructure Engineer Should Learn Python

I have heard the same sentence from good engineers for fifteen years. “I’m not a programmer.” It is usually said with a small note of pride, as if writing code were a separate caste of work that proper infrastructure people are above. I used to half-believe it myself.

I now think it is the single most expensive belief in the industry.

This is not an argument that you should retrain as a software developer. It is an argument that the job has quietly changed underneath us, and that the thing we used to do by hand — clicking through consoles, RDP-ing into boxes, filling in wizards — is increasingly the slow, error-prone, unauditable way to do anything. The fast way is to drive systems by their API. And the most practical language for an infrastructure person to drive APIs with is Python.

I want to be precise about the claim, because the “learn to code” crowd has done real damage by overselling it. I am not telling you to learn data structures, design patterns, or how to build a distributed web service. I am telling you to learn enough Python to turn a repetitive, manual, three-hour job into a fifteen-line script you can run again next quarter. That is a much smaller thing, and it is worth more than people expect.

The console era is ending

For most of my career, the interface to infrastructure was a human sitting in front of a graphical console. You provisioned a VM by clicking through vCenter. You changed a firewall rule in a web UI. You onboarded a user by working down a checklist in three different admin portals. The control plane assumed a person.

That assumption is dead, it just has not finished falling over yet. Every serious platform now leads with its API and treats the GUI as a convenience layer on top. Microsoft 365 is Graph with a portal bolted on. AWS is a set of APIs with a console bolted on. VMware, Citrix, your backup product, your monitoring stack — all of them expose the real machinery as an HTTP API or an SDK, and the GUI is just one client of that API.

The implication is uncomfortable. If you only know the GUI, you are using the slowest, least repeatable client available, and you are the only client that cannot be version-controlled, scheduled, or handed to a colleague. The engineer who can talk to the API directly is not doing something clever or exotic. They are using the platform the way it was actually designed to be used.

If a task is worth doing twice, it is worth not doing by hand the second time.

That is the whole thesis. Everything below is detail.

Why Python specifically

I am not religious about languages. PowerShell is excellent, and if you live entirely inside Microsoft it may be the right first tool. Go is wonderful for shipping a single static binary. But for an infrastructure engineer who wants the broadest possible reach for the least learning, Python wins on a few concrete grounds.

The first is ubiquity of SDKs. Almost every vendor ships, or blesses, a Python library. AWS gives you boto3. VMware gives you pyVmomi. Microsoft gives you msgraph-sdk for Graph. And underneath all of them, when no SDK exists, there is requests, which makes any HTTP API reachable in about four lines. You will very rarely hit a system you genuinely cannot talk to from Python.

The second is readability. Python code reads close to pseudo-code, which matters enormously when the author is not a full-time developer and the next reader is a tired version of yourself in eight months. You can come back to a Python script after a year and understand it. That is not true of every language.

The third, and the one people underrate, is the REPL. You can open a Python prompt, paste in a few lines, and poke at a live API interactively. Print the response. Look at the shape of the JSON. Try one more call. This interactive loop is exactly how an infrastructure person already works — try a thing, look at the result, try the next thing — and it makes learning fast because you are never more than a few seconds from feedback.

>>> import requests
>>> r = requests.get("https://api.github.com/zen")
>>> r.status_code
200
>>> r.text
'Keep it logically awesome.'

That is the entire on-ramp. If you can do that, you can read an API.

Scripts are not software, and that is fine

Here is the distinction that lets infrastructure people relax, and the one the “you must learn to code properly” crowd always blurs.

Writing software is building something that other people depend on, that must handle inputs you have not imagined, that needs tests, packaging, versioning, a support model, and a plan for the day you leave. It is a discipline with real overhead, and that overhead exists for good reasons.

Writing a script is automating a task you understand, with inputs you control, that you will run yourself, and that fails safely and visibly when its assumptions break. The overhead of “proper” software engineering mostly does not apply, because the blast radius is small and the operator is you.

Infrastructure engineers get enormous value from the second thing without ever crossing into the first. A script that pulls a licensing report, a script that checks two hundred mailboxes for a misconfiguration, a script that bulk-updates DNS records from a CSV — none of these need a test suite or a CI pipeline. They need to be correct, readable, re-runnable, and in Git. That is a far lower bar than “be a developer”, and it captures most of the value.

The trap, which I will come back to, is the script that quietly graduates into software without anyone deciding it should. But the line is real, and you are allowed to stay on the infrastructure side of it.

The patterns that actually matter

When I write automation for real work I am not reaching for clever language features. I am reaching for a small set of patterns that turn a fragile one-off into something I trust against a production tenant. These are the things worth learning properly, because they are what separates a script that works in a demo from one that works at three in the afternoon on a customer’s live system.

Idempotency

The most important habit. An idempotent operation produces the same end state whether you run it once or five times. “Create this user” is not idempotent — run it twice and you get an error or a duplicate. “Ensure this user exists with these properties” is idempotent — run it as many times as you like and the system converges on the same state. Write your automation as ensure, not as do, and you can re-run it after a failure without fear.

def ensure_group_member(client, group_id: str, user_id: str) -> bool:
    """Add user to group only if absent. Returns True if a change was made."""
    members = client.get_group_members(group_id)
    if user_id in {m["id"] for m in members}:
        return False  # already correct, do nothing
    client.add_group_member(group_id, user_id)
    return True

The function checks before it acts, makes no change when none is needed, and tells the caller whether it changed anything. That last detail matters more than it looks — it is the difference between a report that says “added 4 users” and one that says “added 4 users, 196 already correct”, which is exactly the kind of evidence you want when you hand the run to someone else.

Retries and backoff

Real APIs throttle you, time out, and occasionally return a 500 for no reason you will ever discover. A script that dies on the first hiccup is useless against anything at scale. The fix is to retry transient failures with exponential backoff, and crucially to honour the API’s own Retry-After header when it gives you one.

import time
import requests

def get_with_retry(url: str, headers: dict, max_attempts: int = 5) -> dict:
    for attempt in range(max_attempts):
        resp = requests.get(url, headers=headers, timeout=30)
        if resp.status_code == 429 or resp.status_code >= 500:
            wait = int(resp.headers.get("Retry-After", 2 ** attempt))
            time.sleep(wait)
            continue
        resp.raise_for_status()
        return resp.json()
    raise RuntimeError(f"Gave up on {url} after {max_attempts} attempts")

Pagination

This is the one that bites everybody the first time. You call an API, get fifty results, write your report, and ship it — and three weeks later someone notices the report only ever shows the first fifty of four hundred mailboxes. Almost every API pages its results, and Microsoft Graph signals more data with an @odata.nextLink. You have to follow it until it stops. Here is a paginated Graph call with the retry logic folded in, which is close to the shape I actually use when feeding a Microsoft 365 health check:

def graph_get_all(path: str, token: str) -> list[dict]:
    """Follow @odata.nextLink and collect every page from a Graph query."""
    headers = {"Authorization": f"Bearer {token}"}
    url = f"https://graph.microsoft.com/v1.0/{path}"
    items: list[dict] = []
    while url:
        page = get_with_retry(url, headers)
        items.extend(page.get("value", []))
        url = page.get("@odata.nextLink")  # None when we run out of pages
    return items

Three patterns — idempotency, retry, pagination — cover most of the bugs I have ever seen in infrastructure scripts. They are not advanced. They are just easy to forget when you are excited that the first call worked.

Secrets, logging, and Git

Three more habits, briefly, because they are the difference between a script you can show a colleague and one you have to apologise for.

Never put a secret in the code. Read it from the environment or a .env file that is in .gitignore, the same discipline I use for everything in the Docker homelab. If a credential ever lands in a Git history, treat it as compromised and rotate it.

import os
client_secret = os.environ["GRAPH_CLIENT_SECRET"]  # not a literal in the file

Log what you did, to standard output, with timestamps. When a run goes wrong at scale you need to know which of the four hundred operations failed and why, and “it errored” is not an answer. Python’s logging module does this with almost no ceremony and is worth the ten minutes it takes to wire up.

And put the script in Git. Not because you need branching strategies, but because a script that changes a production system and lives only on your laptop is an outage waiting to happen. Version control gives you history, a diff when something breaks, and a way to hand the work to the next person. The same argument I make for building knowledge instead of documents applies to automation: if it is not in version control, it does not really exist.

Where this fits together

Once you have a handful of these scripts, they stop being isolated chores and start being a toolkit. Most of my automation falls into a few repeating shapes.

flowchart TD
    A[Manual console task] --> B{Worth automating}
    B -- No --> C[Just do it once]
    B -- Yes --> D[Read the API docs]
    D --> E[Python script in Git]
    E --> F[Health checks]
    E --> G[Bulk changes]
    E --> H[Reporting]
    E --> I[Glue between systems]
    F --> J[Feed data to AI]
    G --> J
    H --> J
    I --> J
    J --> K[Readable prioritised output]

Health checks are the obvious one: pull the current state of a system, compare it against what good looks like, flag the gaps. Bulk changes are next: apply the same change across hundreds of objects, idempotently, with a log of exactly what moved. Reporting turns raw API responses into something a human or a customer can read. And glue is the quiet workhorse — a dozen lines that take the output of one system and feed it into another that was never designed to talk to it.

The newest and most interesting category is feeding data to AI. A Python script is the perfect shim between a messy infrastructure API and a language model. It pulls the raw configuration, normalises it into clean JSON or Markdown, and hands that to an LLM to summarise, prioritise, or explain. That pattern is the entire engine behind a repeatable customer health check: Python does the gathering and the model does the narrative. The script is doing the boring, deterministic part, which is exactly where you want determinism, and the model does the part that genuinely benefits from language.

This is also where the consultancy work connects to the homelab. The same skill that lets me build an AI consultancy toolkit for the day job is the skill that lets me wire up services at home. It is one capability, not two.

What I got wrong

I would be a hypocrite if I made this sound clean. I have made every mistake in the catalogue, and a few are worth naming because they are the ones you will make too.

The worst is the load-bearing script. You write a quick fifteen-line thing to solve a problem on a Tuesday. It works. So you run it again. Then someone else runs it. Then it is in a scheduled job. Then a quarterly process depends on it. At no point did anyone decide this thing was production software, but it now is, and it has no tests, no error handling beyond a stack trace, and exactly one person who understands it. This is how a helpful script becomes a liability. The honest fix is to notice the graduation and either harden the thing deliberately or replace it with a proper tool. The dishonest fix, which I have used more than once, is to hope. Hope is not a maintenance strategy.

The second mistake is reinventing configuration management. The first time you write a script that loops over a hundred servers ensuring a setting is correct, it feels like genius. By the third such script you have built a worse, untested version of Ansible. There is a real boundary here. If you are managing the desired state of a fleet, use Ansible. If you are provisioning and tracking the lifecycle of cloud or virtual infrastructure, use Terraform. Those tools exist because enough people wrote enough one-off scripts to learn that the problem deserves a dedicated, declarative engine with state tracking. Python is the right tool for orchestration, one-off operations, glue, and anything genuinely bespoke. It is the wrong tool for the jobs that already have a mature declarative answer.

The way I now decide is rough but it serves. Is this about the ongoing desired state of many similar things? Reach for Ansible or Terraform. Is this a bespoke flow, a piece of glue, a report, or a thing that has to happen once or on demand? Reach for Python. When in doubt I ask whether I am about to write a state machine, because if I am, someone has already written a better one.

The third mistake is over-engineering in the other direction — adding classes, config frameworks, and abstraction layers to a script that runs once a month. The overhead of “proper” software is overhead you are choosing to carry. For most infrastructure automation, a flat script with good functions and clear logging is the correct level of engineering, and reaching for more is just a different way of avoiding the actual work.

Where this goes next

For my own work the direction is clear, and it is less about writing more scripts than about treating the ones I have as a real asset. I want the recurring ones in a small internal repository with a sane structure, shared helper functions for the patterns above, and a thin layer of tests around the handful that have genuinely become load-bearing — admitting the graduation rather than hoping it away.

The bigger shift is using Python less as the thing that does the work and more as the thing that prepares the work for a model. The collect-normalise-summarise pattern is becoming the default shape of everything I build, because the deterministic gathering belongs in code and the judgement belongs in the model. I expect, over the next year, that more of my scripts shrink to “fetch the data, clean it, hand it over” and that the interesting logic moves up into the prompt and the workflow. That is not Python becoming less important. It is Python finding its proper place as the reliable plumbing under everything else.

The actual point

You do not need to become a software developer. You need to stop doing by hand the things a computer should be doing, and the gateway to that is a few hundred lines of unglamorous, readable Python that you keep in Git and are not embarrassed to show a colleague.

The “I’m not a programmer” line was never really about programming. It was a way of declaring a job done at the edge of the GUI, when the job had quietly extended past it. The platforms moved to APIs. The work moved with them. The engineers who noticed picked up just enough Python to follow, and the gap between them and the ones who did not is widening every year.

It is a small skill. It pays for itself the first afternoon you save. And once you have it, you will wonder how much of your career you spent clicking the same buttons in the same order, getting them subtly wrong, with nothing in version control to show for it.

Learn enough to drive the API. That is the whole of it.