drive-health-metrics/README.md

# drive-health-metrics

Collects per-drive SMART health from **every physical drive** on a host
— direct SATA/SAS, NVMe, and drives hidden behind a RAID controller
(MegaCLI / storcli / perccli) — scores each drive, and exports the result as
**CSV**, **InfluxDB** (line protocol / API push / Kafka), and **Prometheus**.

## Modes

The tool runs one-shot by default and as a long-lived service with `--server`.

### One-shot (default)

Writes CSV or InfluxDB line protocol to stdout once and exits. Run as root
(SMART access requires it):

```
drive-health-metrics                      # CSV to stdout
drive-health-metrics --format influx      # InfluxDB line protocol to stdout (Telegraf exec input)
drive-health-metrics --version
```

### Service (`--server`)

Runs continuously, exposing a Prometheus `/metrics` endpoint and (when
configured) pushing to InfluxDB and/or Kafka on a schedule. Each scrape and
each push re-collects fresh SMART data.

```
drive-health-metrics --server                       # Prometheus on :9101/metrics
drive-health-metrics --server --http-port 9200      # override the port
drive-health-metrics --server -c /etc/drive-health-metrics.yaml
```

Send `SIGHUP` to reload the configuration without a full restart.

The InfluxDB measurement and Prometheus metric prefix are both `drive_health`
(e.g. `drive_health_risk_score`, `drive_health_temp_c`). Identity columns
(serial, model, enclosure_slot, …) are attached as tags/labels.

## Configuration

Service mode reads an optional YAML config, searched in this order: the path
given to `-c`/`--config`, then `./config.yaml`,
`~/.config/drive-health-metrics/config.yaml`, and
`/etc/drive-health-metrics.yaml`. Without a file, sensible defaults apply
(Prometheus enabled on `:9101/metrics`, no Influx push).

```yaml
# config.yaml
hostname: ""              # host tag/label; defaults to the system hostname

http_output:
  enabled: true           # Prometheus /metrics endpoint
  bind_addr: ""           # default: all interfaces
  port: 9101
  metrics_path: /metrics

influx_output:
  frequency: 60s          # push interval; 0 (default) disables the push

  # InfluxDB v2 API (all four required to enable)
  influx_server: https://influx.example.com:8086
  token: my-token
  org: my-org
  bucket: drive-health

  # Kafka (brokers + topic required to enable)
  kafka_brokers: ["kafka1:9092", "kafka2:9092"]
  kafka_topic: telegraf
  kafka_username: ""
  kafka_password: ""
  kafka_insecure_skip_verify: false
  kafka_output_format: lineprotocol   # lineprotocol (default) or json
```

## Recommendation scoring

Each drive gets a `risk_score` and a `recommendation`:

| Recommendation | Meaning |
|----------------|---------|
| `REPLACE_NOW`  | hard defect — drive failing/failed (score ≥ 100) |
| `REPLACE_SOON` | serious wear or accumulating defects (≥ 50) |
| `MONITOR`      | early warning signs (≥ 20) |
| `OK`           | no meaningful defects (< 20) |
| `NO_DATA`      | SMART unreadable **and** no controller red flags — re-collect, don't replace |

Only real, drive-attributable defects add meaningful score; missing/unreadable
data is never treated as a failure.

## Building

```
make build      # native static binary -> dist/drive-health-metrics
make test       # unit tests (parsers + scoring + exporters)
make snapshot   # local GoReleaser build, no publish
```