System Design

Roam Architecture

Roam is a local-first analysis system: parse source once, store structural facts in SQLite, expose deterministic query primitives to CLI and MCP clients.

Design goal: minimise repeated file scanning for agents by turning codebases into a queryable structural index. Query-time latency stays low because expensive parsing happens during index builds, not every command call.

Pipeline at a Glance

Repository ──> Index Pipeline ──> SQLite Storage
                                           │
                          ┌────────────────┼────────────────┐
                          ▼                ▼                ▼
                 Graph Analytics    Retrieval +      CLI / MCP
                 Rules Engine       Patch Verifier   Interfaces
                 Security           Code Graph       JSON / SARIF
                                    Attestation

The index is built once per repo with roam init. Subsequent runs are incremental — only changed files re-parse. All downstream consumers (CLI, MCP, CI gates) read from the same SQLite artefact at .roam/index.db.

Subsystem Responsibilities

SubsystemMain modulesResponsibility
Index Pipeline index/indexer.py, index/parser.py, index/symbols.py Build and refresh the structural index from source + git history.
Storage db/schema.py, db/connection.py SQLite schema, migrations, batched query helpers.
Graph Intelligence graph/builder.py, graph/layers.py, graph/clusters.py, graph/pagerank.py Centrality, layering, communities, cycle analysis, AST clone clustering.
Retrieval retrieve/pipeline.py, retrieve/rerank.py Graph-aware FTS5 + structural reranker (PageRank + co-change + clones + runtime hot).
Patch Verifier critique/checks.py, critique/aggregator.py Diff parsing + clones-not-edited + blast-radius + intent-alignment for roam critique.
Taint & Reachability security/taint_engine.py Graph-reach BFS over edges with sanitiser-stop nodes; OpenVEX-correct.
Code Graph Attestation attest/cga.py in-toto v1 statement builder. Merkle root over symbol fingerprints + edge bundle digest. Cosign-signable.
Fleet Planner fleet/manifest.py Multi-agent partitioner (Louvain + co-change + PageRank anchors); emits .roam-fleet.json.
Rule Engine rules/builtin.py, rules/engine.py Built-in rules + YAML rule packs (path / symbol / AST / dataflow patterns).
Interfaces commands/cmd_*.py, mcp_server.py, mcp_extras/ Deterministic queries for CLI and MCP clients. Sampling-driven compression, watcher-based invalidation, per-session memory.
Output Contracts output/formatter.py, output/sarif.py, output/schema_registry.py Stable text / JSON / SARIF envelopes for agents and CI; every --json error path returns a parseable envelope.

Index Pipeline Stages

  1. Discovery — collect tracked files (via git ls-files + .gitignore) and classify file roles.
  2. Parsing — tree-sitter parse per file with language routing across 28 supported languages.
  3. Extraction — symbols (classes, functions, methods, fields), signatures, docstrings, references.
  4. Resolution — convert references into graph edges (caller→callee, import chains, inheritance).
  5. Metrics — cognitive complexity, centrality (PageRank, betweenness), churn, co-change, cognitive load.
  6. Persistence — upsert into SQLite with incremental diffing; only changed files re-parse.
discover -> parse -> extract -> resolve -> metrics -> persist
                 (incremental path executes only changed files)

Command-to-Data Flow

Example: roam preflight AuthService

CLI cmd_preflight
  -> ensure_index()
  -> query symbols/edges/metrics
  -> run health/rule checks
  -> aggregate verdict + risk factors
  -> render text or JSON envelope

Example: roam cga emit --include-taint --sign

CLI cmd_cga
  -> ensure_index()
  -> attest.cga.build_statement()
       -> _symbol_fingerprints()       # Merkle root over (qname, kind, sig, path)
       -> _edge_bundle_digest()        # graph snapshot fingerprint
       -> security.taint_engine.run()  # graph-reach BFS, sanitizer stops
            -> _finding_to_vex_claim() # OpenVEX status + justification
  -> in-toto v1 Statement
       (predicateType: roam-code.dev/CodeGraph/v1)
  -> cosign sign-blob --bundle         # optional, graceful skip if absent
  -> .roam/attestations/<sha>.intoto.json + .sig

Property: the CGA chain is reproducible — same source tree + same git HEAD → same Merkle root → same predicate digest. Signing layers identity onto a deterministic fingerprint.

Tradeoff: static structure gives speed and determinism, but cannot model fully dynamic runtime behavior without trace ingestion (roam ingest-trace).

Why SQLite