# drive-health-metrics Collects per-drive SMART health from **every physical drive** on a host — direct SATA/SAS, NVMe, and drives hidden behind a RAID controller (MegaCLI / storcli / perccli) — scores each drive, and exports the result as **CSV**, **InfluxDB** (line protocol / API push / Kafka), and **Prometheus**. ## Modes The tool runs one-shot by default and as a long-lived service with `--server`. ### One-shot (default) Writes CSV or InfluxDB line protocol to stdout once and exits. Run as root (SMART access requires it): ``` drive-health-metrics # CSV to stdout drive-health-metrics --format influx # InfluxDB line protocol to stdout (Telegraf exec input) drive-health-metrics --version ``` ### Service (`--server`) Runs continuously, exposing a Prometheus `/metrics` endpoint and (when configured) pushing to InfluxDB and/or Kafka on a schedule. Each scrape and each push re-collects fresh SMART data. ``` drive-health-metrics --server # Prometheus on :9101/metrics drive-health-metrics --server --http-port 9200 # override the port drive-health-metrics --server -c /etc/drive-health-metrics.yaml ``` Send `SIGHUP` to reload the configuration without a full restart. The InfluxDB measurement and Prometheus metric prefix are both `drive_health` (e.g. `drive_health_risk_score`, `drive_health_temp_c`). Identity columns (serial, model, enclosure_slot, …) are attached as tags/labels. ## Configuration Service mode reads an optional YAML config, searched in this order: the path given to `-c`/`--config`, then `./config.yaml`, `~/.config/drive-health-metrics/config.yaml`, and `/etc/drive-health-metrics.yaml`. Without a file, sensible defaults apply (Prometheus enabled on `:9101/metrics`, no Influx push). ```yaml # config.yaml hostname: "" # host tag/label; defaults to the system hostname http_output: enabled: true # Prometheus /metrics endpoint bind_addr: "" # default: all interfaces port: 9101 metrics_path: /metrics influx_output: frequency: 60s # push interval; 0 (default) disables the push # InfluxDB v2 API (all four required to enable) influx_server: https://influx.example.com:8086 token: my-token org: my-org bucket: drive-health # Kafka (brokers + topic required to enable) kafka_brokers: ["kafka1:9092", "kafka2:9092"] kafka_topic: telegraf kafka_username: "" kafka_password: "" kafka_insecure_skip_verify: false kafka_output_format: lineprotocol # lineprotocol (default) or json ``` ## Recommendation scoring Each drive gets a `risk_score` and a `recommendation`: | Recommendation | Meaning | |----------------|---------| | `REPLACE_NOW` | hard defect — drive failing/failed (score ≥ 100) | | `REPLACE_SOON` | serious wear or accumulating defects (≥ 50) | | `MONITOR` | early warning signs (≥ 20) | | `OK` | no meaningful defects (< 20) | | `NO_DATA` | SMART unreadable **and** no controller red flags — re-collect, don't replace | Only real, drive-attributable defects add meaningful score; missing/unreadable data is never treated as a failure. ## Building ``` make build # native static binary -> dist/drive-health-metrics make test # unit tests (parsers + scoring + exporters) make snapshot # local GoReleaser build, no publish ```