System Design
Roam Architecture
Roam is a local-first analysis system: parse source once, store structural facts in SQLite, expose deterministic query primitives to CLI and MCP clients.
Pipeline at a Glance
Repository ──> Index Pipeline ──> SQLite Storage
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Graph Analytics Retrieval + CLI / MCP
Rules Engine Patch Verifier Interfaces
Security Code Graph JSON / SARIF
Attestation
The index is built once per repo with roam init. Subsequent runs are
incremental — only changed files re-parse. All downstream consumers (CLI, MCP, CI gates)
read from the same SQLite artefact at .roam/index.db.
Subsystem Responsibilities
| Subsystem | Main modules | Responsibility |
|---|---|---|
| Index Pipeline | index/indexer.py, index/parser.py, index/symbols.py |
Build and refresh the structural index from source + git history. |
| Storage | db/schema.py, db/connection.py |
SQLite schema, migrations, batched query helpers. |
| Graph Intelligence | graph/builder.py, graph/layers.py, graph/clusters.py, graph/pagerank.py |
Centrality, layering, communities, cycle analysis, AST clone clustering. |
| Retrieval | retrieve/pipeline.py, retrieve/rerank.py |
Graph-aware FTS5 + structural reranker (PageRank + co-change + clones + runtime hot). |
| Patch Verifier | critique/checks.py, critique/aggregator.py |
Diff parsing + clones-not-edited + blast-radius + intent-alignment for roam critique. |
| Taint & Reachability | security/taint_engine.py |
Graph-reach BFS over edges with sanitiser-stop nodes; OpenVEX-correct. |
| Code Graph Attestation | attest/cga.py |
in-toto v1 statement builder. Merkle root over symbol fingerprints + edge bundle digest. Cosign-signable. |
| Fleet Planner | fleet/manifest.py |
Multi-agent partitioner (Louvain + co-change + PageRank anchors); emits .roam-fleet.json. |
| Rule Engine | rules/builtin.py, rules/engine.py |
Built-in rules + YAML rule packs (path / symbol / AST / dataflow patterns). |
| Interfaces | commands/cmd_*.py, mcp_server.py, mcp_extras/ |
Deterministic queries for CLI and MCP clients. Sampling-driven compression, watcher-based invalidation, per-session memory. |
| Output Contracts | output/formatter.py, output/sarif.py, output/schema_registry.py |
Stable text / JSON / SARIF envelopes for agents and CI; every --json error path returns a parseable envelope. |
Index Pipeline Stages
- Discovery — collect tracked files (via
git ls-files+.gitignore) and classify file roles. - Parsing — tree-sitter parse per file with language routing across 28 supported languages.
- Extraction — symbols (classes, functions, methods, fields), signatures, docstrings, references.
- Resolution — convert references into graph edges (caller→callee, import chains, inheritance).
- Metrics — cognitive complexity, centrality (PageRank, betweenness), churn, co-change, cognitive load.
- Persistence — upsert into SQLite with incremental diffing; only changed files re-parse.
discover -> parse -> extract -> resolve -> metrics -> persist
(incremental path executes only changed files)
Command-to-Data Flow
Example: roam preflight AuthService
CLI cmd_preflight
-> ensure_index()
-> query symbols/edges/metrics
-> run health/rule checks
-> aggregate verdict + risk factors
-> render text or JSON envelope
Example: roam cga emit --include-taint --sign
CLI cmd_cga
-> ensure_index()
-> attest.cga.build_statement()
-> _symbol_fingerprints() # Merkle root over (qname, kind, sig, path)
-> _edge_bundle_digest() # graph snapshot fingerprint
-> security.taint_engine.run() # graph-reach BFS, sanitizer stops
-> _finding_to_vex_claim() # OpenVEX status + justification
-> in-toto v1 Statement
(predicateType: roam-code.dev/CodeGraph/v1)
-> cosign sign-blob --bundle # optional, graceful skip if absent
-> .roam/attestations/<sha>.intoto.json + .sig
Property: the CGA chain is reproducible — same source tree + same git HEAD → same Merkle root → same predicate digest. Signing layers identity onto a deterministic fingerprint.
Tradeoff: static structure gives speed and determinism, but cannot model
fully dynamic runtime behavior without trace ingestion (roam ingest-trace).
Why SQLite
- Zero-dep: ships with Python; no client/server, no infrastructure.
- FTS5: full-text search built-in, used for symbol search and retrieval.
- Determinism: same input source → identical SQLite file (modulo timestamps). Diffable, reproducible.
- Portable:
roam index-exportemits a tarball with manifest SHA-256 + optional cosign signature;index-importverifies before extracting. - Local: index lives at
.roam/index.dbin your repo. No source code leaves the machine for the free CLI.