SDK
Graph Operations Runbook
Operating, troubleshooting, and rolling out hybrid graph retrieval safely.
This runbook is for production operation of graph indexing and hybrid retrieval.
Quick Health Checks
Check graph status:
curl -s -H "Authorization: Bearer $MEMORIES_API_KEY" \
"https://memories.sh/api/sdk/v1/graph/status" | jqReview trace behavior for a representative query:
curl -s -X POST \
-H "Authorization: Bearer $MEMORIES_API_KEY" \
-H "Content-Type: application/json" \
"https://memories.sh/api/sdk/v1/graph/trace" \
-d '{
"query": "why did fallback alarms increase",
"strategy": "hybrid_graph",
"graphDepth": 1,
"graphLimit": 8,
"scope": { "projectId": "github.com/acme/platform" }
}' | jqInspect rollout mode and shadow metrics:
curl -s -H "Authorization: Bearer $MEMORIES_API_KEY" \
"https://memories.sh/api/sdk/v1/graph/rollout" | jqRecommended Rollout Sequence
- Start in
offafter deployment validation. - Move to
shadowand observe at least 24h of representative traffic. - Promote to
canarywhen fallback rate and graph error fallbacks are stable. - Stay in
canarywhile monitoring alarms and trace fallback reasons. - If quality gate blocks canary (
CANARY_ROLLOUT_BLOCKED), remain inshadowuntil regressions clear.
Patch mode via SDK endpoint:
curl -s -X PATCH \
-H "Authorization: Bearer $MEMORIES_API_KEY" \
-H "Content-Type: application/json" \
"https://memories.sh/api/sdk/v1/graph/rollout" \
-d '{"mode":"shadow"}' | jqAlert Interpretation
Current built-in alarms include:
HIGH_FALLBACK_RATEGRAPH_EXPANSION_ERRORSCANARY_QUALITY_GATE_BLOCKED(hard canary block)
Current threshold logic in status payload generation:
- Critical fallback alarm at fallback rate
>= 15%with at least20requests in window. - Warning fallback alarm at fallback rate
>= 5%with at least10requests in window. - Expansion error alarm when graph expansion errors are observed.
- Quality gate block when fallback/relevance regressions exceed rollout thresholds.
See implementation in:
/Users/tradecraft/dev/memories/packages/web/src/lib/memory-service/graph/status.ts
Common Failure Modes
1) schema_missing in graph status
Symptoms:
- Graph dashboard shows schema missing
- Graph node/edge counts are zero while memories exist
Actions:
- Confirm workspace Turso credentials are valid.
- Execute a memory write to trigger best-effort schema/init path.
- Recheck
/api/sdk/v1/graph/status. - If still missing, inspect server logs for Turso DDL/write errors.
2) Zero nodes/edges despite new memories
Symptoms:
- Memory count increases, graph counts do not
Actions:
- Validate writes are not bypassing standard memory mutation paths.
- Confirm extraction/upsert code paths are enabled.
- Verify graph tables exist and are writable.
- Use trace endpoint to confirm fallback reason and graph candidate counts.
3) High fallback rate in canary
Symptoms:
- Rising fallback rate or repeated
fallbackReason
Actions:
- Inspect
trace.fallbackReasondistribution. - If
graph_expansion_errordominates, roll back toshadowand investigate graph retrieval query errors. - If
rollout_guardrail/shadow_modedominates, verify requested strategy and rollout settings. - If
quality_gate_blockedappears, inspectgraph/rolloutqualityGate.reasonsand fix regressions. - Re-promote to
canaryonly after stable shadow metrics and a passing quality gate.
4) Empty graph explorer but status healthy
Symptoms:
- Dashboard graph explorer panel has no edges for selected nodes
Actions:
- Confirm selected node has linked edges via
/api/graph/explore. - Clear explorer filters (edge type, node type, min confidence/weight, evidence-only).
- Verify selected workspace contains the expected
tenantIddata and optionalprojectIdfilter context.
Operational Guardrails
- Keep read paths available even when graph indexing fails.
- Treat
shadowas the default safe state during incidents. - Use trace payloads for every rollout decision, not only aggregate metrics.
- Use
projectId-filtered checks for repo-specific incidents (it is a context filter, not an auth boundary).
Incident Checklist
- Capture
graph/status,graph/rollout, andgraph/tracepayloads. - Record rollout mode and active alarms.
- Capture top fallback reasons and request volume.
- Set temporary mode to
shadowif user impact is visible. - Validate recovery before returning to
canary.