98 lines
3.3 KiB
Markdown
98 lines
3.3 KiB
Markdown
# drive-health-metrics
|
|
|
|
Collects per-drive SMART health from **every physical drive** on a host
|
|
— direct SATA/SAS, NVMe, and drives hidden behind a RAID controller
|
|
(MegaCLI / storcli / perccli) — scores each drive, and exports the result as
|
|
**CSV**, **InfluxDB** (line protocol / API push / Kafka), and **Prometheus**.
|
|
|
|
## Modes
|
|
|
|
The tool runs one-shot by default and as a long-lived service with `--server`.
|
|
|
|
### One-shot (default)
|
|
|
|
Writes CSV or InfluxDB line protocol to stdout once and exits. Run as root
|
|
(SMART access requires it):
|
|
|
|
```
|
|
drive-health-metrics # CSV to stdout
|
|
drive-health-metrics --format influx # InfluxDB line protocol to stdout (Telegraf exec input)
|
|
drive-health-metrics --version
|
|
```
|
|
|
|
### Service (`--server`)
|
|
|
|
Runs continuously, exposing a Prometheus `/metrics` endpoint and (when
|
|
configured) pushing to InfluxDB and/or Kafka on a schedule. Each scrape and
|
|
each push re-collects fresh SMART data.
|
|
|
|
```
|
|
drive-health-metrics --server # Prometheus on :9101/metrics
|
|
drive-health-metrics --server --http-port 9200 # override the port
|
|
drive-health-metrics --server -c /etc/drive-health-metrics.yaml
|
|
```
|
|
|
|
Send `SIGHUP` to reload the configuration without a full restart.
|
|
|
|
The InfluxDB measurement and Prometheus metric prefix are both `drive_health`
|
|
(e.g. `drive_health_risk_score`, `drive_health_temp_c`). Identity columns
|
|
(serial, model, enclosure_slot, …) are attached as tags/labels.
|
|
|
|
## Configuration
|
|
|
|
Service mode reads an optional YAML config, searched in this order: the path
|
|
given to `-c`/`--config`, then `./config.yaml`,
|
|
`~/.config/drive-health-metrics/config.yaml`, and
|
|
`/etc/drive-health-metrics.yaml`. Without a file, sensible defaults apply
|
|
(Prometheus enabled on `:9101/metrics`, no Influx push).
|
|
|
|
```yaml
|
|
# config.yaml
|
|
hostname: "" # host tag/label; defaults to the system hostname
|
|
|
|
http_output:
|
|
enabled: true # Prometheus /metrics endpoint
|
|
bind_addr: "" # default: all interfaces
|
|
port: 9101
|
|
metrics_path: /metrics
|
|
|
|
influx_output:
|
|
frequency: 60s # push interval; 0 (default) disables the push
|
|
|
|
# InfluxDB v2 API (all four required to enable)
|
|
influx_server: https://influx.example.com:8086
|
|
token: my-token
|
|
org: my-org
|
|
bucket: drive-health
|
|
|
|
# Kafka (brokers + topic required to enable)
|
|
kafka_brokers: ["kafka1:9092", "kafka2:9092"]
|
|
kafka_topic: telegraf
|
|
kafka_username: ""
|
|
kafka_password: ""
|
|
kafka_insecure_skip_verify: false
|
|
kafka_output_format: lineprotocol # lineprotocol (default) or json
|
|
```
|
|
|
|
## Recommendation scoring
|
|
|
|
Each drive gets a `risk_score` and a `recommendation`:
|
|
|
|
| Recommendation | Meaning |
|
|
|----------------|---------|
|
|
| `REPLACE_NOW` | hard defect — drive failing/failed (score ≥ 100) |
|
|
| `REPLACE_SOON` | serious wear or accumulating defects (≥ 50) |
|
|
| `MONITOR` | early warning signs (≥ 20) |
|
|
| `OK` | no meaningful defects (< 20) |
|
|
| `NO_DATA` | SMART unreadable **and** no controller red flags — re-collect, don't replace |
|
|
|
|
Only real, drive-attributable defects add meaningful score; missing/unreadable
|
|
data is never treated as a failure.
|
|
|
|
## Building
|
|
|
|
```
|
|
make build # native static binary -> dist/drive-health-metrics
|
|
make test # unit tests (parsers + scoring + exporters)
|
|
make snapshot # local GoReleaser build, no publish
|
|
```
|