Graph Operations Runbook

This runbook is for production operation of graph indexing and hybrid retrieval.

Quick Health Checks

Check graph status:

curl -s -H "Authorization: Bearer $MEMORIES_API_KEY" \
  "https://memories.sh/api/sdk/v1/graph/status" | jq

Review trace behavior for a representative query:

curl -s -X POST \
  -H "Authorization: Bearer $MEMORIES_API_KEY" \
  -H "Content-Type: application/json" \
  "https://memories.sh/api/sdk/v1/graph/trace" \
  -d '{
    "query": "why did fallback alarms increase",
    "strategy": "hybrid_graph",
    "graphDepth": 1,
    "graphLimit": 8,
    "scope": { "projectId": "github.com/acme/platform" }
  }' | jq

Inspect rollout mode and shadow metrics:

curl -s -H "Authorization: Bearer $MEMORIES_API_KEY" \
  "https://memories.sh/api/sdk/v1/graph/rollout" | jq

Recommended Rollout Sequence

Start in off after deployment validation.
Move to shadow and observe at least 24h of representative traffic.
Promote to canary when fallback rate and graph error fallbacks are stable.
Stay in canary while monitoring alarms and trace fallback reasons.
If quality gate blocks canary (CANARY_ROLLOUT_BLOCKED), remain in shadow until regressions clear.

Patch mode via SDK endpoint:

curl -s -X PATCH \
  -H "Authorization: Bearer $MEMORIES_API_KEY" \
  -H "Content-Type: application/json" \
  "https://memories.sh/api/sdk/v1/graph/rollout" \
  -d '{"mode":"shadow"}' | jq

Alert Interpretation

Current built-in alarms include:

HIGH_FALLBACK_RATE
GRAPH_EXPANSION_ERRORS
CANARY_QUALITY_GATE_BLOCKED (hard canary block)

Current threshold logic in status payload generation:

Critical fallback alarm at fallback rate >= 15% with at least 20 requests in window.
Warning fallback alarm at fallback rate >= 5% with at least 10 requests in window.
Expansion error alarm when graph expansion errors are observed.
Quality gate block when fallback/relevance regressions exceed rollout thresholds.

See implementation in:

/Users/tradecraft/dev/memories/packages/web/src/lib/memory-service/graph/status.ts

Common Failure Modes

1) `schema_missing` in graph status

Symptoms:

Graph dashboard shows schema missing
Graph node/edge counts are zero while memories exist

Actions:

Confirm workspace Turso credentials are valid.
Execute a memory write to trigger best-effort schema/init path.
Recheck /api/sdk/v1/graph/status.
If still missing, inspect server logs for Turso DDL/write errors.

2) Zero nodes/edges despite new memories

Symptoms:

Memory count increases, graph counts do not

Actions:

Validate writes are not bypassing standard memory mutation paths.
Confirm extraction/upsert code paths are enabled.
Verify graph tables exist and are writable.
Use trace endpoint to confirm fallback reason and graph candidate counts.

3) High fallback rate in `canary`

Symptoms:

Rising fallback rate or repeated fallbackReason

Actions:

Inspect trace.fallbackReason distribution.
If graph_expansion_error dominates, roll back to shadow and investigate graph retrieval query errors.
If rollout_guardrail/shadow_mode dominates, verify requested strategy and rollout settings.
If quality_gate_blocked appears, inspect graph/rollout qualityGate.reasons and fix regressions.
Re-promote to canary only after stable shadow metrics and a passing quality gate.

4) Empty graph explorer but status healthy

Symptoms:

Dashboard graph explorer panel has no edges for selected nodes

Actions:

Confirm selected node has linked edges via /api/graph/explore.
Clear explorer filters (edge type, node type, min confidence/weight, evidence-only).
Verify selected workspace contains the expected tenantId data and optional projectId filter context.

Operational Guardrails

Keep read paths available even when graph indexing fails.
Treat shadow as the default safe state during incidents.
Use trace payloads for every rollout decision, not only aggregate metrics.
Use projectId-filtered checks for repo-specific incidents (it is a context filter, not an auth boundary).

Incident Checklist

Capture graph/status, graph/rollout, and graph/trace payloads.
Record rollout mode and active alarms.
Capture top fallback reasons and request volume.
Set temporary mode to shadow if user impact is visible.
Validate recovery before returning to canary.

Graph Operations Runbook

On this page