3.3 KiB
drive-health-metrics
Collects per-drive SMART health from every physical drive on a host — direct SATA/SAS, NVMe, and drives hidden behind a RAID controller (MegaCLI / storcli / perccli) — scores each drive, and exports the result as CSV, InfluxDB (line protocol / API push / Kafka), and Prometheus.
Modes
The tool runs one-shot by default and as a long-lived service with --server.
One-shot (default)
Writes CSV or InfluxDB line protocol to stdout once and exits. Run as root (SMART access requires it):
drive-health-metrics # CSV to stdout
drive-health-metrics --format influx # InfluxDB line protocol to stdout (Telegraf exec input)
drive-health-metrics --version
Service (--server)
Runs continuously, exposing a Prometheus /metrics endpoint and (when
configured) pushing to InfluxDB and/or Kafka on a schedule. Each scrape and
each push re-collects fresh SMART data.
drive-health-metrics --server # Prometheus on :9101/metrics
drive-health-metrics --server --http-port 9200 # override the port
drive-health-metrics --server -c /etc/drive-health-metrics.yaml
Send SIGHUP to reload the configuration without a full restart.
The InfluxDB measurement and Prometheus metric prefix are both drive_health
(e.g. drive_health_risk_score, drive_health_temp_c). Identity columns
(serial, model, enclosure_slot, …) are attached as tags/labels.
Configuration
Service mode reads an optional YAML config, searched in this order: the path
given to -c/--config, then ./config.yaml,
~/.config/drive-health-metrics/config.yaml, and
/etc/drive-health-metrics.yaml. Without a file, sensible defaults apply
(Prometheus enabled on :9101/metrics, no Influx push).
# config.yaml
hostname: "" # host tag/label; defaults to the system hostname
http_output:
enabled: true # Prometheus /metrics endpoint
bind_addr: "" # default: all interfaces
port: 9101
metrics_path: /metrics
influx_output:
frequency: 60s # push interval; 0 (default) disables the push
# InfluxDB v2 API (all four required to enable)
influx_server: https://influx.example.com:8086
token: my-token
org: my-org
bucket: drive-health
# Kafka (brokers + topic required to enable)
kafka_brokers: ["kafka1:9092", "kafka2:9092"]
kafka_topic: telegraf
kafka_username: ""
kafka_password: ""
kafka_insecure_skip_verify: false
kafka_output_format: lineprotocol # lineprotocol (default) or json
Recommendation scoring
Each drive gets a risk_score and a recommendation:
| Recommendation | Meaning |
|---|---|
REPLACE_NOW |
hard defect — drive failing/failed (score ≥ 100) |
REPLACE_SOON |
serious wear or accumulating defects (≥ 50) |
MONITOR |
early warning signs (≥ 20) |
OK |
no meaningful defects (< 20) |
NO_DATA |
SMART unreadable and no controller red flags — re-collect, don't replace |
Only real, drive-attributable defects add meaningful score; missing/unreadable data is never treated as a failure.
Building
make build # native static binary -> dist/drive-health-metrics
make test # unit tests (parsers + scoring + exporters)
make snapshot # local GoReleaser build, no publish