Driftstack DRIFTSTACK docs
Docs

Prometheus metrics

GET /metrics

Driftstack exposes an in-process counter registry over a single Prometheus-compatible scrape endpoint. The format is plain-text exposition format (text/plain; version=0.0.4; charset=utf-8); any Prometheus-compatible scraper (Prometheus itself, VictoriaMetrics, Grafana Agent, OpenTelemetry Collector with the Prometheus receiver) can consume it.

This page is for operators, not API consumers — you only need it if you’re integrating Driftstack into your own observability stack.

Auth

The endpoint is publicly addressable (so external scrapers can reach it without needing an internal-only path) but bearer-token gated:

GET /metrics HTTP/1.1
Host: api.driftstack.dev
Authorization: Bearer <METRICS_SCRAPE_TOKEN>

Missing / wrong token → 401. Token-unset deployments → 503 (the gate is opt-in; the registry isn’t constructed unless the token env var is wired).

The token rotates on the same cadence as other internal credentials and is provisioned via the deploy bridge (/opt/driftstack/api/.env).

Cardinality

All exposed counters use bounded label sets — every label value comes from a closed enum or namespace prefix. There are no account-id labels, no api-key-id labels, no IP-address labels. The total time-series count is dominated by the cross-product of small enums; the scrape size stays well under the Prometheus default sample_limit.

Catalogue

The current counter catalogue (all driftstack_* namespaced):

Foundational

MetricLabelsWhat it tracks
driftstack_http_request_totalmethod, route, status_classEvery HTTP request. route is the Fastify route TEMPLATE (e.g. /v1/sessions/:id), never the raw URL — keeps cardinality bounded by the registered-route count. status_class is 1xx5xx.

Auth + rate limiting

MetricLabelsWhat it tracks
driftstack_auth_totaloutcomerequireAuth resolution outcomes (ok / unauthorized / invalid / revoked / expired / forbidden / error)
driftstack_rate_limit_totalbucket, outcomerate-limit consumes per bucket × allowed/exceeded
driftstack_oauth_token_totaloutcomeOAuth /token exchange outcomes (ok + the OAuthError code set + error)

Agent + LLM rails

MetricLabelsWhat it tracks
driftstack_agent_decompose_totalresult_kindagent decompose() calls by result (plan / clarify / refuse)
driftstack_pair_mode_transition_totalfrom, topair-mode state-machine transitions
driftstack_bundled_llm_request_totaloutcomebundled-LLM decompose requests by outcome
driftstack_bundled_llm_error_totalkindbundled-LLM decompose errors (consent_missing / budget_exhausted)
driftstack_byok_anthropic_test_totaloutcomeBYOK Anthropic /test endpoint outcomes (ok / invalid / quota_exceeded / not_set / not_wired / unknown)

Webhook ingress

MetricLabelsWhat it tracks
driftstack_stripe_webhook_totaloutcomeStripe inbound webhook outcomes (handled / duplicate / ignored / error / signature_invalid / signature_missing / empty_body / malformed_event)
driftstack_nowpayments_webhook_totaloutcomeNOWPayments IPN outcomes (ok / signature_invalid / signature_missing / empty_body / malformed_event)

Webhook delivery (outbound)

MetricLabelsWhat it tracks
driftstack_webhook_delivery_attempt_totaloutcomeEvery dispatcher attempt to a customer’s endpoint (success / http_error / timeout / transport_error)
driftstack_webhook_delivery_terminal_totalterminal_stateTerminal-state transitions only — delivered on first 2xx, dlq when retries exhaust (DEFAULT_MAX_ATTEMPTS = 6)

Audit log

MetricLabelsWhat it tracks
driftstack_account_audit_emit_totalprefix, actor_typeCustomer-facing audit log emissions, namespace-bucketed
driftstack_admin_audit_emit_totalprefixAdmin (/v1/admin/*) audit log emissions, namespace-bucketed

Live-preview (LiveKit)

MetricLabelsWhat it tracks
driftstack_livekit_token_mint_totalrole, outcomeLiveKit token mint requests. Two emission sites share the counter: /v1/sessions/:id/livekit-token (legacy session-livekit; role = publisher) and /v1/agent-sessions/:id/livekit-token (agent-chat; role = subscriber). Outcomes: ok / not_found / validation / forbidden / no_mac / secret_unreadable. role=unknown on early-reject paths.
driftstack_mac_node_livekit_register_totaloutcomePOST /v1/mac-nodes/register outcomes per call: ok (credentials persisted), validation (Zod parse failed), encryption_error (AES-256-GCM seal failed; ops alert — likely MFA_ENCRYPTION_KEY length wrong), not_found (mac_node_id has no fleet_nodes row — Mac provisioning hasn’t run yet), unknown.

Transactional email

MetricLabelsWhat it tracks
driftstack_email_send_totaltemplate, outcomeOutbound transactional-email sends per template × outcome (ok / pending-approval / inactive-recipient / account-inactive / invalid-request / rate-limited / transport / unknown)

Suggested alerts

Reasonable starting alerts (translate to your alert-manager rules language):

  • rate(driftstack_auth_total{outcome="invalid"}[5m]) > 0.1 — sustained invalid-key rate suggests credential stuffing.
  • rate(driftstack_auth_total{outcome="revoked"}[15m]) > 0 — a revoked key is being retried; investigate the calling client (it should rotate its credentials).
  • rate(driftstack_stripe_webhook_total{outcome="signature_invalid"}[15m]) > 0 — any failed-signature webhook is a spoofing attempt; investigate.
  • rate(driftstack_nowpayments_webhook_total{outcome="signature_invalid"}[15m]) > 0 — same posture as Stripe; crypto-payment spoofing attempt.
  • rate(driftstack_bundled_llm_error_total{kind="budget_exhausted"}[1h]) > 1 — multiple customers hitting the bundled-LLM cap signals demand outstripping the deployment-fallback budget.
  • rate(driftstack_byok_anthropic_test_total{outcome="quota_exceeded"}[1h]) > 5 — multiple customers’ Anthropic accounts are throttling; an upstream Anthropic-side incident.
  • rate(driftstack_oauth_token_total{outcome="invalid_client"}[15m]) > 0.5 — failed client_id+client_secret exchanges at scale signal a brute-force probe.
  • rate(driftstack_rate_limit_total{outcome="exceeded"}[5m]) > 1 — sustained limit hits across the account base; either ramp the defaults or audit which buckets saturate.
  • rate(driftstack_email_send_total{outcome="pending-approval"}[1h]) > 0 — Postmark approval is STILL blocking transactional sends; chase with their compliance team.
  • rate(driftstack_email_send_total{outcome="transport"}[15m]) > 0.1 — sustained Postmark connectivity failures; check the status page
    • the egress network from the API host.
  • sum by (prefix) (rate(driftstack_admin_audit_emit_total[1h])) > 10 — unusually high admin-action volume in any one prefix bucket; audit whether the activity is expected.

Set thresholds per your traffic baseline; the rates above are illustrative.

Format

The exposition format is the text-based variant documented at prometheus.io/docs/instrumenting/exposition_formats. Counters emit:

# HELP driftstack_auth_total Auth resolution outcomes (...).
# TYPE driftstack_auth_total counter
driftstack_auth_total{outcome="ok"} 1234
driftstack_auth_total{outcome="invalid"} 7

Scraper-side resets: counters reset to 0 on process restart. The standard Prometheus rate() and irate() functions handle resets correctly; sum metrics over longer windows in your dashboards.

Source of truth

Counter name + label catalogue lives in apps/server/src/services/metrics-registry.ts (METRIC_NAMES constant). The integration-test fixture pre-registers the same set; contributors adding a new counter must update both + a parity test typically lives alongside the call site.