Drive SMART health collector for prometheus and influxdb.
Find a file
James Coleman ddafa90a02
Some checks failed
Go package / build (push) Has been cancelled
first commit
2026-06-22 17:16:34 -05:00
.github/workflows first commit 2026-06-22 17:16:34 -05:00
testdata first commit 2026-06-22 17:16:34 -05:00
.gitignore first commit 2026-06-22 17:16:34 -05:00
.goreleaser.yaml first commit 2026-06-22 17:16:34 -05:00
collect.go first commit 2026-06-22 17:16:34 -05:00
config.go first commit 2026-06-22 17:16:34 -05:00
controller.go first commit 2026-06-22 17:16:34 -05:00
discover.go first commit 2026-06-22 17:16:34 -05:00
drive-health-metrics first commit 2026-06-22 17:16:34 -05:00
drive.go first commit 2026-06-22 17:16:34 -05:00
drive_test.go first commit 2026-06-22 17:16:34 -05:00
exec.go first commit 2026-06-22 17:16:34 -05:00
exporter.go first commit 2026-06-22 17:16:34 -05:00
exporter_test.go first commit 2026-06-22 17:16:34 -05:00
flags.go first commit 2026-06-22 17:16:34 -05:00
go.mod first commit 2026-06-22 17:16:34 -05:00
go.sum first commit 2026-06-22 17:16:34 -05:00
http.go first commit 2026-06-22 17:16:34 -05:00
influx.go first commit 2026-06-22 17:16:34 -05:00
jsonutil.go first commit 2026-06-22 17:16:34 -05:00
LICENSE.txt first commit 2026-06-22 17:16:34 -05:00
main.go first commit 2026-06-22 17:16:34 -05:00
Makefile first commit 2026-06-22 17:16:34 -05:00
output.go first commit 2026-06-22 17:16:34 -05:00
README.md first commit 2026-06-22 17:16:34 -05:00
realdata_test.go first commit 2026-06-22 17:16:34 -05:00
schema.go first commit 2026-06-22 17:16:34 -05:00
score.go first commit 2026-06-22 17:16:34 -05:00
smart_json.go first commit 2026-06-22 17:16:34 -05:00
smart_text.go first commit 2026-06-22 17:16:34 -05:00
VERSION first commit 2026-06-22 17:16:34 -05:00
version.go first commit 2026-06-22 17:16:34 -05:00

drive-health-metrics

Collects per-drive SMART health from every physical drive on a host — direct SATA/SAS, NVMe, and drives hidden behind a RAID controller (MegaCLI / storcli / perccli) — scores each drive, and exports the result as CSV, InfluxDB (line protocol / API push / Kafka), and Prometheus.

Modes

The tool runs one-shot by default and as a long-lived service with --server.

One-shot (default)

Writes CSV or InfluxDB line protocol to stdout once and exits. Run as root (SMART access requires it):

drive-health-metrics                      # CSV to stdout
drive-health-metrics --format influx      # InfluxDB line protocol to stdout (Telegraf exec input)
drive-health-metrics --version

Service (--server)

Runs continuously, exposing a Prometheus /metrics endpoint and (when configured) pushing to InfluxDB and/or Kafka on a schedule. Each scrape and each push re-collects fresh SMART data.

drive-health-metrics --server                       # Prometheus on :9101/metrics
drive-health-metrics --server --http-port 9200      # override the port
drive-health-metrics --server -c /etc/drive-health-metrics.yaml

Send SIGHUP to reload the configuration without a full restart.

The InfluxDB measurement and Prometheus metric prefix are both drive_health (e.g. drive_health_risk_score, drive_health_temp_c). Identity columns (serial, model, enclosure_slot, …) are attached as tags/labels.

Configuration

Service mode reads an optional YAML config, searched in this order: the path given to -c/--config, then ./config.yaml, ~/.config/drive-health-metrics/config.yaml, and /etc/drive-health-metrics.yaml. Without a file, sensible defaults apply (Prometheus enabled on :9101/metrics, no Influx push).

# config.yaml
hostname: ""              # host tag/label; defaults to the system hostname

http_output:
  enabled: true           # Prometheus /metrics endpoint
  bind_addr: ""           # default: all interfaces
  port: 9101
  metrics_path: /metrics

influx_output:
  frequency: 60s          # push interval; 0 (default) disables the push

  # InfluxDB v2 API (all four required to enable)
  influx_server: https://influx.example.com:8086
  token: my-token
  org: my-org
  bucket: drive-health

  # Kafka (brokers + topic required to enable)
  kafka_brokers: ["kafka1:9092", "kafka2:9092"]
  kafka_topic: telegraf
  kafka_username: ""
  kafka_password: ""
  kafka_insecure_skip_verify: false
  kafka_output_format: lineprotocol   # lineprotocol (default) or json

Recommendation scoring

Each drive gets a risk_score and a recommendation:

Recommendation Meaning
REPLACE_NOW hard defect — drive failing/failed (score ≥ 100)
REPLACE_SOON serious wear or accumulating defects (≥ 50)
MONITOR early warning signs (≥ 20)
OK no meaningful defects (< 20)
NO_DATA SMART unreadable and no controller red flags — re-collect, don't replace

Only real, drive-attributable defects add meaningful score; missing/unreadable data is never treated as a failure.

Building

make build      # native static binary -> dist/drive-health-metrics
make test       # unit tests (parsers + scoring + exporters)
make snapshot   # local GoReleaser build, no publish