The Modern Logging Stack: Loki + Alloy (Why Not Promtail)
Most Loki tutorials still start with helm install loki-stack. That Helm chart bundles Promtail, which Grafana deprecated in Loki v3.4 and stops supporting in March 2026. The Grafana Agent reached End-of-Life in November 2025. If you deploy either today, you're installing software that no longer receives security patches.
This cluster never ran Promtail. It was built on Loki v3.6.3 and Grafana Alloy v1.12.2 from the start: one Loki pod in SingleBinary mode, three Alloy pods as a DaemonSet, and 4.17 million log lines per day compressed into 107 MB on disk. The entire logging stack is 4 pods and 640 MB of requested RAM. The deployment took 15 minutes. Finding where it breaks took a month.
This is Part 6 of the homelab series. Part 4 deployed Gateway API for traffic routing. Part 5 added Longhorn for distributed storage. Now we're collecting and storing every log line the cluster generates.
The Deprecation Chain
Grafana's collector story has consolidated three times in two years:
| Collector | Status (Feb 2026) | Replaced By |
|---|---|---|
| Promtail | Deprecated (Loki v3.4), EOL March 2026 | Grafana Alloy |
| Grafana Agent (Static/Flow) | EOL November 2025 | Grafana Alloy |
| Grafana Alloy | Active | — |
Alloy isn't a rebrand. It's Grafana's distribution of the OpenTelemetry Collector, with native support for metrics, logs, traces, and profiles in a programmable configuration language called River. For anyone migrating from Promtail, alloy convert --source-format=promtail translates YAML configs to River syntax ("best-effort" — expect manual tuning). This cluster skipped the migration by starting with Alloy.
Loki: Why SingleBinary
The Loki Helm chart defaults to SimpleScalable mode: 3 read pods, 3 write pods, 3 backend pods, a gateway, and a canary. That's 10+ pods before you store a single log line. For a homelab generating 1.8 GB/day of raw logs, it's massive overkill.
SingleBinary runs every Loki component in one process. The official docs recommend it for up to "a few tens of GB/day." This cluster generates roughly 1/10th of that lower bound.
The Helm values that matter:
deploymentMode: SingleBinary
loki:
auth_enabled: false
commonConfig:
replication_factor: 1
schemaConfig:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
singleBinary:
replicas: 1
persistence:
storageClass: longhorn
size: 10Gi
Three decisions stand out. First, replication_factor: 1 because Longhorn already replicates the PVC across two nodes (Part 5). Adding Loki-level replication would triple write I/O for no additional protection. Second, TSDB with the v13 schema. Older guides reference BoltDB, which is deprecated since Loki v3.5. TSDB is the current standard for new deployments. Third, retention is 90 days with the compactor enforcing it:
limits_config:
retention_period: 2160h
reject_old_samples: true
reject_old_samples_max_age: 168h
compactor:
retention_enabled: true
compaction_interval: 10m
The reject_old_samples_max_age of 168 hours rejects any log timestamped more than 7 days ago. Without this, a misconfigured client could backfill months of old data into Loki in a single push.
Everything else is disabled. No gateway (Cilium handles routing). No canary. No caches. No self-monitoring (Prometheus handles that externally via ServiceMonitors). The full values file explicitly sets read, write, and backend replicas to zero, preventing the chart from sneaking in SimpleScalable components:
backend:
replicas: 0
read:
replicas: 0
write:
replicas: 0
chunksCache:
enabled: false
resultsCache:
enabled: false
gateway:
enabled: false
Install with:
helm install loki oci://ghcr.io/grafana/helm-charts/loki \
--namespace monitoring \
--version 6.49.0 \
--values helm/loki/values.yaml
What the Data Shows
That config has been running since mid-January 2026. Live cluster state:
$ kubectl get pods -n monitoring -l app.kubernetes.io/name=loki -o wide
NAME READY STATUS RESTARTS AGE NODE
loki-0 2/2 Running 0 24d k8s-cp2
$ kubectl get pods -n monitoring -l app.kubernetes.io/name=alloy -o wide
NAME READY STATUS RESTARTS AGE NODE
alloy-9pcf7 2/2 Running 0 24d k8s-cp3
alloy-pk7fh 2/2 Running 0 24d k8s-cp2
alloy-rztt4 2/2 Running 0 24d k8s-cp1
Zero restarts across all 4 pods. Prometheus metrics tell the ingestion story:
| Metric | Value |
|---|---|
| Ingestion rate | 48.3 lines/sec |
| Daily log lines | ~4.17 million |
| Raw volume (uncompressed) | ~1.8 GB/day |
| On-disk volume (Snappy compressed) | ~107 MB/day |
| Compression ratio | 17:1 |
| PVC usage (at snapshot) | 2.58 GB of 10 Gi (24.6%) |
Resource consumption:
$ kubectl top pods -n monitoring -l app.kubernetes.io/name=loki
NAME CPU(cores) MEMORY(bytes)
loki-0 11m 402Mi
$ kubectl top pods -n monitoring -l app.kubernetes.io/name=alloy
NAME CPU(cores) MEMORY(bytes)
alloy-9pcf7 7m 238Mi
alloy-pk7fh 9m 147Mi
alloy-rztt4 9m 223Mi
Total actual usage: ~36m CPU and ~1,010 Mi memory across all 4 pods. For comparison, the Helm chart's default SimpleScalable mode deploys 9 Loki pods, a gateway, and a canary before collecting a single log line.
Storage Projection
At 107 MB/day on disk, 90 days of retention projects to ~9.6 GB. The 10 Gi PVC will be close to capacity once the compactor starts deleting old data at the 90-day mark. Bumping to 15-20 Gi before that point is on the to-do list.
Memory Pressure
Alloy's memory varies by node. On k8s-cp3, the pod uses 238 Mi against a 256 Mi limit (93%). On k8s-cp1, 223 Mi (87%). The AlloyHighMemory alert fires at 80%, so these pods have been triggering warnings. The variance likely correlates with the number of pods on each node generating logs. Adding workloads could push at least one Alloy pod into OOM territory.
Loki's working set (402 Mi) exceeds its memory request (256 Mi) while staying within the 512 Mi limit. Since the scheduler uses requests for placement, Loki would be an early eviction candidate under node memory pressure. Both the Alloy limits and the Loki request need adjusting.
Dropped Entries
Since deployment, Alloy has dropped 97,345 entries with ingester_error across all 3 pods. These likely occurred during the Loki Helm upgrade (revision 2). Against ~100 million total entries, that's roughly 0.1% data loss. No drops from rate limiting, line length, or stream limits. The single-replica trade-off: when Loki restarts, the pipeline has nowhere to buffer.
The Alloy Pipeline
Alloy deploys as a DaemonSet, one pod per node. Each pod collects logs only from containers on its own node, scoped by env("HOSTNAME"), which resolves to the node name in a DaemonSet. The pipeline discovers all pods in the cluster, filters to the local node, tails their logs via the Kubernetes API, and pushes them to Loki. The full config in River syntax inside the Helm values:
// Discover all pods in the cluster
discovery.kubernetes "pods" {
role = "pod"
}
// Filter to this node + extract Kubernetes labels
discovery.relabel "pod_logs" {
targets = discovery.kubernetes.pods.targets
// Keep only pods on this node (DaemonSet scope)
rule {
source_labels = ["__meta_kubernetes_pod_node_name"]
action = "keep"
regex = env("HOSTNAME")
}
rule {
source_labels = ["__meta_kubernetes_namespace"]
target_label = "namespace"
}
rule {
source_labels = ["__meta_kubernetes_pod_name"]
target_label = "pod"
}
rule {
source_labels = ["__meta_kubernetes_pod_container_name"]
target_label = "container"
}
rule {
source_labels = ["__meta_kubernetes_pod_node_name"]
target_label = "node"
}
}
// Tail logs via Kubernetes API
loki.source.kubernetes "pod_logs" {
targets = discovery.relabel.pod_logs.output
forward_to = [loki.process.pod_logs.receiver]
}
// Add cluster label, forward to Loki
loki.process "pod_logs" {
stage.static_labels {
values = { cluster = "homelab" }
}
forward_to = [loki.write.loki.receiver]
}
loki.write "loki" {
endpoint {
url = "http://loki.monitoring.svc.cluster.local:3100/loki/api/v1/push"
}
}
The discovery.relabel stage extracts Kubernetes metadata (namespace, pod, container, node) as labels and filters by node name. The loki.process stage adds a static cluster = "homelab" label before forwarding to Loki.
API Collection vs HostPath
This pipeline uses loki.source.kubernetes, which tails logs through the Kubernetes API. The alternative, loki.source.file, mounts /var/log/pods as a hostPath volume and reads log files directly from disk.
The API method avoids hostPath mounts entirely. No volume privileges, no filesystem access to the node. The trade-off is performance: the API method opens an HTTP stream per container and routes through the API server, which adds latency and network overhead. On clusters with hundreds of containers, this can measurably load the control plane.
At 48 lines per second across this cluster, the API method runs without issues. Each Alloy pod consumes 7-9m CPU. For larger clusters or higher log volumes, loki.source.file with hostPath volumes is the higher-throughput option.
Kubernetes Events Without Triplicates
Kubernetes events (pod scheduling, image pulls, OOM kills) are cluster-scoped, not node-local. Every Alloy pod in the DaemonSet sees the same events. Without deduplication, 3 pods would push 3 copies of every event into Loki.
A stage.match drop rule handles this:
loki.source.kubernetes_events "cluster_events" {
log_format = "logfmt"
forward_to = [loki.process.cluster_events.receiver]
}
loki.process "cluster_events" {
stage.static_labels {
values = {
cluster = "homelab",
source = "kubernetes_events",
collector_node = env("HOSTNAME"),
}
}
// Only k8s-cp1 forwards events; other nodes drop them
stage.match {
selector = "{collector_node!=\"k8s-cp1\"}"
action = "drop"
}
forward_to = [loki.write.loki.receiver]
}
Each pod tags events with its node name via collector_node. Only k8s-cp1's Alloy instance passes the forwarding rule. The other two drop their events at the processing stage before they reach Loki. Query events in Grafana with {source="kubernetes_events"}.
Alloy also supports a clustering mode with a consistent hashing ring that assigns singleton workloads automatically. For three nodes, the explicit drop rule is simpler to reason about and debug.
Two Paths Into Loki
The logging stack accepts data from two sources through two different protocols:
Alloy pushes Kubernetes logs via the Loki API at /loki/api/v1/push. This is the standard path for container logs collected by the DaemonSet.
An OTel Collector (v0.144.0) pushes application events via Loki's native OTLP endpoint at /otlp. The exporter config is minimal:
exporters:
otlphttp/loki:
endpoint: http://loki.monitoring.svc.cluster.local:3100/otlp
Loki v3.x accepts OTLP natively with the TSDB schema. In this cluster, the OTel Collector sends Claude Code telemetry events, queryable in Grafana with {service_name="claude-code"}. Any application instrumented with an OpenTelemetry SDK can ship logs through this same path without Alloy in the middle.
The OTLP endpoint is the forward-looking ingestion path. As more workloads adopt OTel, direct-to-Loki becomes the default; Alloy handles the Kubernetes infrastructure logs that predate OTel instrumentation.
Monitoring the Monitors
Seven PrometheusRules watch the logging stack. Four for Loki, three for Alloy:
| Alert | Severity | Condition |
|---|---|---|
LokiDown |
critical | Loki unreachable for 5 min |
LokiIngestionStopped |
warning | Zero lines received for 15 min |
LokiHighErrorRate |
warning | >10% HTTP 5xx rate for 10 min |
LokiStorageLow |
warning | PVC <20% free for 30 min |
AlloyNotOnAllNodes |
warning | Fewer pods than nodes for 10 min |
AlloyNotSendingLogs |
warning | Zero bytes to Loki for 15 min |
AlloyHighMemory |
warning | Pod >80% memory limit for 10 min |
All seven are PromQL (metric-based), scraped via ServiceMonitors at 30-second intervals. They answer "is the logging pipeline working?" They don't answer "what's happening in my applications?" That distinction matters.
What's Missing
The logging stack collects and stores logs. It does not analyze them. Five gaps are worth calling out.
The biggest is log-based alerting. Loki supports ruler-based LogQL alerts, but none are configured here. You can detect "Loki is dead" but not "my application is throwing 500 errors." A rule like count_over_time({namespace="production"} |= "ERROR" [5m]) > 10 would bridge that gap. That's Part 7 material.
The Alloy pipeline doesn't parse log content. It extracts Kubernetes metadata (namespace, pod, container, node) as labels, but JSON fields inside log lines aren't indexed. Finding a specific error requires full-text |= or regex matches in LogQL, not label selectors.
There's no Loki dashboard in Grafana. Dashboards exist for UPS, kube-vip, and the kube-prometheus-stack defaults, but nothing visualizes Loki's ingestion rate, storage growth, or error rate. The ServiceMonitors expose these metrics; building a dashboard from them is straightforward.
Loki is a single replica. The 97K dropped entries during upgrades show the consequence: when the pod restarts, Alloy buffers fail and entries are lost. For log completeness, run 3 SingleBinary replicas with replication_factor: 3. For a homelab where 99.9% retention is acceptable, one pod is fine.
No external backup exists. If the Longhorn volume suffers a double node failure, logs are gone. No S3 tier, no cross-cluster shipping. Longhorn's 2-replica protection is the only safety net.
What's Next
This post is the sixth in the "Building a Production-Grade Homelab" series:
- Why kubeadm Over k3s, RKE2, and Talos in 2026
- HA Control Plane with kube-vip: No Load Balancer Needed
- Cilium Deep Dive: What Replacing kube-proxy Actually Means
- Gateway API vs Ingress: No Ingress Controller Needed
- Distributed Storage with Longhorn: 2 Replicas Are Enough
- The Modern Logging Stack: Loki + Alloy (Why Not Promtail) (you are here)
- Alerting That Actually Wakes You Up: Discord, Email, and Dead Man's Switches
- Self-Hosted GitLab: CI/CD Without Cloud Vendor Lock-in
Part 7 covers the alerting layer: Discord webhooks, email notifications, and healthchecks.io dead man's switches for detecting silent failures.
The full Loki and Alloy Helm values, ServiceMonitors, and alert rules live in the homelab repo.