Skip to content
8 min read

Distributed Storage with Longhorn: 2 Replicas Are Enough

Every Longhorn guide says 3 replicas. On a 3-node cluster, 2 give identical availability with 50% more usable space. 22 volumes and months of real data.
Isometric diagram showing three server nodes with NVMe drives connected by replication lines forming a distributed storage pool

Every Longhorn tutorial tells you to set 3 replicas. On a 3-node cluster, that stores every byte three times, one copy per node, consuming a third of your storage for the third copy. But Kubernetes itself needs a control plane quorum to function. If two of three nodes die simultaneously, your cluster is already gone. The third replica protects against a failure that already killed you. Two replicas provide the same effective availability on a 3-node cluster, with 50% more usable space.

This cluster has been running Longhorn with 2 replicas for months. 22 volumes across 12 namespaces, 229 Gi allocated, 73.6 GiB of actual data, and 2.7 GB of RAM for the entire storage layer. It also has zero backups and zero recurring snapshots. Both the configuration and the gaps are intentional content for this post.

Local-path provisioner is the common alternative: your PVC binds to a directory on one node. Fast, simple, and pinned to hardware. The moment that node fails, every StatefulSet depending on it is dead until you physically recover the drive. Rook-Ceph goes the other direction: full distributed storage with 6-10 GB of RAM for the control plane alone on a 3-node cluster, plus dedicated unformatted disks. On three Lenovo M80q mini-PCs with 16 GB each, Ceph would consume a third of the cluster's memory before any workload data exists.

Longhorn occupies the middle ground. No global quorum, no dedicated disks, no operator constellation. It runs as 23 pods and consumes 2.7 GB of RAM total. On NVMe SSDs, three drives become a replicated storage pool that survives node failures.

This is Part 5 of the homelab series. Part 3 deployed Cilium as the CNI. Part 4 set up Gateway API for routing. Now we're adding the stateful layer.

The 2-Replica Math

Longhorn defaults to 3 replicas per volume. On this cluster, that default wastes storage:

Replica Count Raw Storage (3 nodes) Usable Capacity Survives
1 ~1,200 GB ~1,200 GB 0 node failures
2 ~1,200 GB ~600 GB 1 node failure
3 ~1,200 GB ~400 GB 2 node failures

With 2 replicas, every write goes to two different nodes. If one node dies, every volume still has a copy on a surviving node. So 2 replicas provides the same effective availability as 3 replicas on a 3-node cluster, with 50% more usable space.

The key Helm values:

defaultSettings:
  defaultReplicaCount: 2
  defaultDataPath: /var/lib/longhorn
  storageMinimalAvailablePercentage: 10  # 40GB buffer per node
  storageOverProvisioningPercentage: 100 # no overprovisioning
  replicaAutoBalance: best-effort

persistence:
  defaultClass: true
  defaultClassReplicaCount: 2
  reclaimPolicy: Delete

longhornManager:
  priorityClass: system-cluster-critical
longhornDriver:
  priorityClass: system-cluster-critical

Two settings deserve attention. Longhorn installs its own longhorn-critical priority class by default, but the Helm values override it with system-cluster-critical (best practices). This prevents the kubelet from evicting Longhorn pods under memory pressure. Without it, a resource spike could evict the instance manager, instantly detaching every volume on that node. And storageOverProvisioningPercentage: 100 disables thin provisioning at the scheduling level. Longhorn won't promise more space than physically exists, which prevents the cascade where etcd fills its disk because Longhorn overcommitted.

How Longhorn Differs from Ceph

Most distributed storage systems manage a shared pool. Ceph stripes data across all OSDs, maintaining a CRUSH map for placement decisions. A failure in one OSD can affect placement groups serving many volumes.

Longhorn takes a different approach: each volume is an independent microservice. When you create a PVC, Longhorn spins up an engine (controller) that presents a block device via iSCSI, plus replicas on separate nodes that store copies of the data. If a volume's engine crashes, only that volume is affected. The other 21 volumes keep serving I/O. No cluster-wide quorum freezes the data plane when a node goes down.

The trade-off is the iSCSI dependency. The V1 data engine routes all I/O through iscsid on each node. On NVMe drives capable of 3,500 MB/s, the kernel-userspace context switches cap throughput around 300-800 MB/s depending on workload. A V2 engine based on SPDK exists as experimental in v1.10.x (promoted to technical preview in v1.11.0), bypassing the kernel entirely. For a homelab prioritizing stability over throughput, V1 is the right choice today.

Setup and Verification

An Ansible playbook installs prerequisites on each node: the /var/lib/longhorn data directory, the iscsid daemon, and the NFS client. The data directory sits on each node's 512 GB NVMe SSD, sharing the root LVM volume rather than a dedicated partition. Longhorn's disk type detection shows "unknown" in the UI because it can't parse the LVM device path, but this is cosmetic.

helm install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --version 1.10.1 \
  --values helm/longhorn/values.yaml

This cluster runs v1.10.1. The v1.10.x line has a v1.10.2 patch with fixes for replica auto-balance and encrypted volume expansion, worth upgrading to before jumping to v1.11. Twenty-three pods come up: three instance managers, three Longhorn managers, three CSI plugins, eight CSI sidecars, plus the UI and engine image pods.

$ kubectl get storageclass
NAME                 PROVISIONER          RECLAIMPOLICY   VOLUMEBINDINGMODE   AGE
longhorn (default)   driver.longhorn.io   Delete          Immediate           52d

$ kubectl -n longhorn-system get nodes.longhorn.io
NAME      READY   ALLOWSCHEDULING   SCHEDULABLE   AGE
k8s-cp1   True    true              True          52d
k8s-cp2   True    true              True          52d
k8s-cp3   True    true              True          52d

Longhorn registers itself as the default StorageClass. Any PVC without an explicit storageClassName gets Longhorn automatically.

22 Volumes in Production

Every stateful workload in the cluster uses Longhorn. The live PVC inventory:

$ kubectl get pvc -A --no-headers | wc -l
22

Twenty-two PVCs across 12 namespaces, 229 Gi allocated. But allocation tells half the story. Longhorn uses thin provisioning at the volume level: a 50 Gi PVC only consumes disk space as data is written. The actual numbers:

Metric Value
PVCs allocated 229 Gi
Actual data written ~73.6 GiB (32% of allocated)
On-disk with 2x replication ~147 GiB
Raw disk available (3 nodes) ~1,035 GiB
Disk utilization ~14%

Prometheus at 50 Gi and GitLab Gitaly at 50 Gi account for 100 Gi of requested space but only ~46.5 GiB of actual data. Prometheus is the real consumer at 45.2 GiB of time-series data. Gitaly holds 50 Gi allocated but just 1.3 GiB used. With 14% disk utilization cluster-wide, there's no pressure to right-size.

All 44 replicas (22 volumes x 2) are distributed across nodes:

$ kubectl -n longhorn-system get volumes.longhorn.io \
    -o custom-columns='NAME:.metadata.name,STATE:.status.state,ROBUSTNESS:.status.robustness,REPLICAS:.spec.numberOfReplicas' \
    | head -8
NAME                                       STATE      ROBUSTNESS   REPLICAS
pvc-0a3e7b1c-...                           attached   healthy      2
pvc-1d4f8a2e-...                           attached   healthy      2
pvc-2c5b9d3f-...                           attached   healthy      2
pvc-3e6c0a4b-...                           attached   healthy      2
pvc-4f7d1b5c-...                           detached   unknown      2
pvc-5a8e2c6d-...                           attached   healthy      2
pvc-6b9f3d7e-...                           attached   healthy      2

21 of 22 volumes show healthy robustness. The one unknown is a detached backup volume with no active consumer pod. Longhorn can't assess robustness when no engine is running, which is expected behavior for idle volumes.

The distribution isn't perfectly even: cp1 holds 19 replicas, cp3 holds 16, and cp2 holds 9. I noticed this in the Longhorn UI weeks after initial deployment. The replicaAutoBalance: best-effort setting actively rebalances replicas when it detects skew, but the observed imbalance suggests it tolerates moderate unevenness rather than chasing perfect distribution. The v1.10.2 patch includes replica auto-balance fixes, which may explain why the imbalance persists on v1.10.1. For 22 volumes this has no practical impact, but it's the kind of operational detail that most Longhorn blog posts skip.

The Resource Cost

Distributed storage isn't free. Running 22 volumes across 3 nodes costs:

Component Memory Note
Instance managers (3) 1,910 Mi Heaviest: manages all engines + replicas per node
Longhorn managers (3) 549 Mi Control plane + webhook
CSI plugins (3) 119 Mi DaemonSet, one per node
CSI sidecars (8) 125 Mi Attacher, provisioner, resizer, snapshotter
UI + deployer + engine images 23 Mi Minimal
Total ~2,726 Mi (~2.7 GB) 5.7% of 48 GB cluster RAM

CPU sits at ~207m total across all pods. For context, Rook-Ceph on a comparable 3-node cluster would consume 6-10 GB of RAM for the storage control plane alone (source), plus OSD memory. Longhorn's overhead scales with volume count, not cluster size. The instance managers are the biggest consumers because they host the actual engine and replica processes for every active volume on their node.

Longhorn handles databases, Prometheus TSDB, Loki chunks, and application data where replication and failover matter. Bulk media goes to a separate NFS tier: a Dell 3090 running OpenMediaVault handles Immich photos and future media storage where capacity and shared access matter more than HA. Longhorn supports ReadWriteMany via an NFS sidecar, but routing block storage through an NFS server for shared access adds latency and a single point of failure. The dedicated NAS is simpler.

What's Missing (and Why It Matters)

Two features are conspicuously absent from this cluster, and a production deployment should have both.

No backup target. Longhorn supports incremental backups to S3-compatible storage or NFS. The NAS at 10.10.30.4 already has a /Kubernetes export. Pointing Longhorn's backup-target setting to nfs://10.10.30.4:/Kubernetes/Backups would give every volume off-cluster backup capability with deduplication. Without it, a simultaneous failure of two nodes (fire, power surge, bad firmware update) loses all data. The NVMe drives replicate across nodes, not off-site.

No recurring snapshots. Longhorn's RecurringJob CRD supports cron-scheduled snapshots with configurable retention. A daily snapshot with 7-day retention is table stakes for production. The CRD exists in the cluster but has zero jobs configured. Snapshots are local (stored as differencing disk chains on the same nodes), so they protect against accidental deletion and corruption, not hardware failure. They complement backups, not replace them.

Both gaps exist because the cluster has been running without incident since Longhorn was deployed. That's not a reason to skip them. The next post-publishing task is configuring the NFS backup target and a daily snapshot job for all volumes.

Honest Trade-offs

Longhorn's microservices model has real constraints beyond the iSCSI throughput ceiling.

Single-node attachment. A ReadWriteOnce volume can only be attached to one node at a time. If that node becomes NotReady, Kubernetes waits ~6 minutes before force-detaching the volume and allowing rescheduling. The Longhorn UI has a "Force Detach" button for manual intervention, but the default timeout means your database is down for 6 minutes during an unplanned node failure.

Snapshot chains affect performance. Each snapshot adds a layer to a differencing disk chain. Long chains force the engine to traverse multiple layers for reads. Longhorn's "Snapshot Purge" operation consolidates chains but consumes CPU and I/O. Without recurring cleanup, snapshot chains grow indefinitely.

No multi-path redundancy. Longhorn uses a single TCP connection between engine and replica. On a network with a single switch (like this homelab's 2.5 GbE LIANGUO switch), a switch failure takes down all replication. Dedicated storage networks with bonded NICs mitigate this in enterprise setups.

Data locality disabled. The cluster runs dataLocality: disabled (the default). Enabling best-effort would place one replica on the same node as the consuming pod, reducing read latency by avoiding the network hop. On a 2.5 GbE flat network with 3 nodes, the impact is minimal. On a larger cluster or with write-heavy databases, it's worth enabling.

What's Next

This post is the fifth in the "Building a Production-Grade Homelab" series:

  1. Why kubeadm Over k3s, RKE2, and Talos in 2026
  2. HA Control Plane with kube-vip: No Load Balancer Needed
  3. Cilium Deep Dive: What Replacing kube-proxy Actually Means
  4. Gateway API vs Ingress: No Ingress Controller Needed
  5. Distributed Storage with Longhorn: 2 Replicas Are Enough (you are here)
  6. The Modern Logging Stack: Loki + Alloy (Why Not Promtail)
  7. Alerting That Actually Wakes You Up: Discord, Email, and Dead Man's Switches
  8. Self-Hosted GitLab: CI/CD Without Cloud Vendor Lock-in

Part 6 covers observability with Loki and Alloy, replacing the deprecated Promtail agent with Grafana's unified telemetry collector.

The full Longhorn Helm values and all PVC manifests live in the homelab repo.