My Homelab Needed a Map, Not a Dashboard

My homelab did what homelabs do: it grew sideways.

There was a router. Then there was a Proxmox host. Then a NAS. Then a lab box. Then a few containers that were "temporary" until my family started relying on them. Then a dev machine. Then backups. Then a reverse proxy. Then some domains. Then an old switch that was fine until it wasn't. Then a new switch that could do VLANs, even if I wasn't quite ready to light them up yet.

Eventually, I still knew where everything lived.

Mostly.

In the way you know where everything is in a garage because you're the one who stacked the boxes.

That works until it doesn't.

So I built netdocs-lab: a Markdown-based documentation site for my home infrastructure. Not a dashboard. Not a monitoring system. Not another half-finished internal app. Just a living map of what exists, why it exists, where it runs, and what I should do when something breaks.

Dashboards answer the wrong question

I like dashboards. I run monitoring. I have Uptime Kuma checks. I like little green boxes as much as the next person.

But a dashboard mostly answers one question:

Is it up?

That is useful, but it is not the same as understanding the system.

When my internet is down, I do not just need to know that the router is unreachable. I need to know what the router is, what hardware it runs on, how the VM is configured, what the rollback path is, where the old config might be, and what order I should check things in before I make the outage worse.

When a service stops working, I need more than a red status badge. I need to know which host owns it, whether it depends on storage, whether it sits behind HAProxy, whether Cloudflare is involved, whether it has a local backup, and whether my family is about to ask why something is broken.

A dashboard tells me what is happening right now.

Documentation tells me what I thought I built.

I needed both, but I needed the second one first.

The shape of the thing

The basic idea is intentionally boring:

flowchart TD
    A[Hosts, services, disks, notes] --> B[Markdown source]
    B --> C[Mermaid diagrams]
    B --> D[MkDocs navigation]
    C --> E[Static documentation site]
    D --> E
    E --> F[Nginx]
    F --> G[Browser, terminal, future me]

    H[Git repo] --> B
    B --> H

That is it.

The docs are not trapped in a wiki database. They are not notes scattered across a dozen places. They are not screenshots of diagrams that will be wrong the next time I move a service.

They are text. They can be read over SSH. They can be searched with grep. They can be edited by hand. They can be reviewed in git. They can be generated, cleaned up, or reorganized with help from coding agents. And if the rendered site disappears, the source still makes sense.

The rendered version is nicer to browse, but the Markdown is the actual thing.

What netdocs-lab documents

The site is organized around how I actually think about the lab:

flowchart LR
    A[netdocs-lab] --> B[Hosts]
    A --> C[Network]
    A --> D[Services]
    A --> E[Storage]
    A --> F[Backups]
    A --> G[Runbooks]
    A --> H[Migration Notes]

    B --> B1[Odo]
    B --> B2[Dax]
    B --> B3[Rom]
    B --> B4[Quarks]

    C --> C1[DNS]
    C --> C2[Ingress]
    C --> C3[Switching]
    C --> C4[VLAN Plans]

    D --> D1[Vaultwarden]
    D --> D2[HAProxy]
    D --> D3[Jellyfin]
    D --> D4[Gitea]

The exact pages will change over time, but the categories matter. They match the way I actually troubleshoot the lab.

The host pages describe the physical and virtual machines: what they are, what role they play, what hardware they have, and what they are allowed to be responsible for.

The service pages describe things like Vaultwarden, HAProxy, Gitea, mail, Jellyfin, monitoring, and backup infrastructure.

The network pages cover the current physical topology, DNS/domain strategy, IP plan, core switch, and the still-deferred VLAN work.

The storage pages are where the disks, pools, mounts, backup philosophy, and offsite strategy live.

The migration pages track the ongoing shape of the rebuild: what changed, what is done, what is deferred, and what was intentionally not made more complicated yet.

The runbooks are the practical part. Internet down. Router recovery. Restart a service. Get console access to the switch. The stuff I do not want to rediscover while something is already broken.

The DS9 problem

The network has names because I need the help.

Odo is the router. Dax is the boring production-ish box. Rom is the lab/NAS/tinkering machine. Quarks is the lived-in shell and dev cockpit. The names are cute, but they are also useful. A good name carries intent.

router-01 tells me a machine routes packets.

Odo reminds me that this box is the boundary, the changeling, the thing that keeps order at the edge.

That sounds silly until you are tired, something is broken, and the name instantly tells you which mental model to load.

The documentation turns that private mental model into something a future version of me can still understand.

The migration made the docs necessary

This all became more important during a home infrastructure refresh.

I was moving from a more accidental layout toward something cleaner:

a dedicated router host
a clearer production/lab split
newer storage layout
better backup strategy
managed switching
VLAN capability, even if deferred
critical services moved to more reliable hardware
less "I know where that container is" energy

The migration was not just technical. It was operational.

This is a home network. People use it. I work from it. My family streams media through it. Passwords, mail, DNS, DHCP, reverse proxying, backups, and local services all have different blast radiuses.

So the docs had to capture more than commands.

They had to capture intent.

Why did I defer VLANs? Because stability mattered more than segmentation during the first pass.

Why did I separate critical services from the older lab hardware? Because a lab box should be allowed to be weird without taking down authentication or mail.

Why does storage live where it lives? Because direct disk access, backup flow, and recovery paths matter more than making the diagram pretty.

Those are the decisions that disappear first if they are not written down.

Markdown is the right amount of technology

I did not want a heavyweight documentation platform for this.

I wanted something that felt close to the system.

Markdown is boring in the best possible way. It is plain text. It survives tool changes. It works in git. It is easy to restructure. It is easy to render. It is also easy for AI tools to work with without turning the project into a pile of proprietary state.

That matters because part of the goal is lowering the activation energy.

If documenting the network requires opening a special app, logging into a wiki, dragging boxes around, exporting diagrams, and manually keeping everything in sync, I will not do it consistently.

If documenting the network means editing a Markdown file and committing it, I might.

That is the whole game.

Diagrams should be source, not screenshots

One of the things I wanted from this setup was diagramming that would not rot instantly.

I have made plenty of diagrams that only existed as a PNG, a draw.io file, a screenshot, or a half-remembered whiteboard. Those are fine for a moment in time, but they are not great as living documentation.

Mermaid is a good compromise for this kind of work. It lets the diagram live as text next to the rest of the docs.

That means the diagram can be versioned. It can be reviewed. It can be copied into an issue. It can be changed without opening a separate design tool. It is not always the prettiest possible representation, but it is good enough and close enough to the source.

For infrastructure docs, good enough and easy to update beats beautiful and abandoned.

The docs are not the source of truth

This is the uncomfortable part.

The docs are not reality.

The running systems are reality. The configs are reality. The disks, services, firewall rules, DNS records, and backup jobs are reality.

Documentation is a model.

That means the docs can lie.

A polished documentation site can be worse than no docs at all if it looks authoritative but describes something that stopped being true six months ago. So part of the discipline here is being honest about status.

Some pages are current.

Some pages are aspirational.

Some sections say "planned."

Some things are deliberately marked as deferred.

That is important. I do not want docs that pretend the lab is more mature than it is. I want docs that help me make the next safe change.

Where AI fits

This project also turned out to be a good use case for coding agents.

Not because I want AI to invent my infrastructure. I absolutely do not.

But agents are useful at taking messy raw material and turning it into a first draft:

command output
lsblk
df -h
Proxmox notes
service lists
migration plans
backup jobs
old chat notes
half-written runbooks

The useful workflow is not "let AI document my infrastructure."

It is more like this:

sequenceDiagram
    participant Me as Me
    participant Lab as Running Systems
    participant Agent as Coding Agent
    participant Docs as Markdown Docs
    participant Git as Git

    Me->>Lab: Collect facts and command output
    Me->>Agent: Provide notes, constraints, and source material
    Agent->>Docs: Draft Markdown structure
    Me->>Docs: Review, correct, and remove lies
    Me->>Git: Commit only what I am willing to trust later
    Git->>Docs: Publish rendered site

That boundary matters. AI can help with structure and cleanup, but I still own the truth.

The important boundary is that AI can draft structure, but I still have to own truth.

The pattern that seems to work is:

Collect the facts.
Let the agent turn them into readable Markdown.
Review the result like it is a junior admin who is very fast and occasionally wrong.
Commit only what I am willing to trust later.

That makes the documentation process feel less like homework.

AI lowers the activation energy enough that I actually document things.

What I like about it so far

The biggest win is that the lab feels less like a pile.

I can open the docs and see the system as a system.

There is a page for the network. There is a page for storage. There are pages for hosts and services. There are runbooks for failure modes. There is a migration checklist that says what has happened and what has not.

That does not make the lab enterprise-grade. It does not need to.

It makes it legible.

That is enough.

What is still rough

There are still problems.

Some information is manual and will drift.

Some pages are too detailed while others need more detail.

Some generated sections need editing before they feel like something I would want to read during an outage.

The diagrams are useful, but they are not magic.

The migration docs are valuable now, but some of them will eventually become historical notes instead of operational docs.

And the biggest risk is the obvious one: the docs have to stay close to the system. If changing the system does not include changing the docs, then I am just building a nicer-looking archive of lies.

What I want next

The next step is to make the docs a little more inventory-driven.

I do not need a full CMDB. I do not need to turn my house into a corporate IT department. But I would like more of the basic host and service facts to come from structured data instead of hand-maintained prose.

Something like:

host inventory
service inventory
ownership / criticality
backup status
public vs internal access
dependency notes
last-reviewed dates

Then the docs could be partly generated and partly written.

That feels like the right split.

Machines are good at facts.

Humans are better at meaning.

The actual lesson

The project started as documentation, but the real output was clarity.

Writing down the lab forced me to answer questions I had been carrying around informally:

Which services are actually critical?
Which host is allowed to fail?
What depends on storage?
What depends on DNS?
What has an offsite backup?
What is public?
What is internal-only?
What did I defer on purpose?
What would I do if the router VM did not come back?

Those are not dashboard questions.

They are architecture questions.

And for a homelab, the architecture does not have to be fancy. It just has to be understandable enough that when something breaks, I am not relying entirely on memory and vibes.

The goal is not to make the homelab look enterprise. The goal is to make the next change safer.

flowchart TD
    A[Something breaks] --> B{Do I understand the system?}
    B -->|No| C[Guess, SSH around, make it worse]
    B -->|Mostly| D[Open docs]
    D --> E[Find host, service, dependency, runbook]
    E --> F[Make smaller change]
    F --> G[Update docs after the fix]

That loop is the whole point.

netdocs-lab is not finished. It probably never will be.

But now the lab has a map.

That is a lot better than a garage full of boxes.