clawft

Knowledge Graph (Graphify)

clawft-graphify: AST-based knowledge graph extraction, community detection, analysis, and export for code assessment and forensic investigation.

The clawft-graphify crate builds knowledge graphs from source code and investigative documents. It extracts entities and relationships via tree-sitter AST parsing, clusters them into communities, analyzes structural patterns, and exports interactive visualizations.

Crate: clawft-graphify Source: 35 modules, 88 tests CLI: weaver graphify

Why Graphify

WeftOS needs to understand the systems it manages. Graphify provides two domain-specific lenses:

  • Code Assessment: Map modules, classes, functions, imports, call graphs, and inferred dependencies into a queryable knowledge graph. Detect god nodes, surprising cross-module coupling, and architectural drift.
  • Forensic Analysis: Extract persons, events, evidence, locations, timelines, and their relationships from investigative documents. Run gap analysis, coherence scoring, and counterfactual predictions.

Both domains share the same underlying graph infrastructure: BLAKE3-hashed entity IDs, confidence-weighted edges, label propagation clustering, and petgraph-backed traversal.

Quick Start

# Scan a project and build the knowledge graph
weaver graphify rebuild /path/to/project

# View the report
cat graphify-out/GRAPH_REPORT.md

# Export to interactive HTML
weaver graphify export html

# Search the graph
weaver graphify query "authentication"

# Ingest a URL (tweet, arXiv paper, webpage)
weaver graphify ingest https://arxiv.org/abs/2301.12345

Example Output

Running weaver graphify rebuild on a project produces output like this:

Rebuilding knowledge graph from: /path/to/project
Scanning files...
Detected 247 files (189 code, 31 doc, 4 paper, 23 image)
Skipped 2 sensitive file(s)
Wrote /path/to/project/graphify-out/graph.json
Wrote /path/to/project/graphify-out/GRAPH_REPORT.md

Graph summary:
  Nodes: 247
  Edges: 1,042
  Communities: 14
  Top god node: lib (38 edges)
  Files processed: 247

The pipeline runs six stages: detect (scan files) -> extract (build entities) -> build (construct graph) -> cluster (community detection) -> analyze (god nodes, surprising connections, questions) -> export (JSON + GRAPH_REPORT.md).

What Gets Extracted

Graphify operates in two modes depending on available features.

Default (no tree-sitter)

Without the ast-extract feature, the pipeline uses file detection and builds a file-level knowledge graph:

  • File detection: Scans the project tree for code, document, paper, and image files (30+ extensions). Sensitive files (.env, credentials) are skipped automatically.
  • File-level entities: Each detected file becomes a node in the graph, typed by category (Module for code files, document/paper/image custom types for others).
  • Co-location edges: Files in the same directory are connected via RelatedTo edges with Inferred confidence, using a star topology (first file in each directory becomes the hub). Directories with fewer than 2 or more than 50 files are excluded to reduce noise.
  • Analysis output: The report includes file counts by type, community clustering via label propagation, and god nodes (the most-connected files that represent coupling hotspots).

This mode requires no tree-sitter dependencies and works on any project.

With tree-sitter (--features ast-extract)

With the ast-extract feature (and one or more lang-* features), the pipeline performs full AST extraction:

  • Structural AST walk: Extracts classes, structs, functions, interfaces, enums, imports, and constants from source files using tree-sitter grammars.
  • Call graph inference: Second pass connects function calls across files, building Calls, Imports, ImportsFrom, Extends, Implements, and MethodOf relationships.
  • Rationale comments: Extracts TODO, FIXME, and doc comments as metadata on entities.
  • Supported languages: Python, JavaScript/TypeScript, Rust, Go (plus Java, C, C++, Ruby, C# with their respective feature flags).

See the Supported Languages table for the full list of languages and extracted entity types.

CLI Reference

weaver graphify ingest <path|url>

Ingest a local path or URL into the knowledge graph. Local directories delegate to the full extraction pipeline (detect, extract, build, cluster, analyze, export) — equivalent to weaver graphify rebuild <dir>. URLs are fetched, classified, and saved as annotated markdown for re-extraction.

OptionDefaultDescription
-o, --outputgraphify-out/memoryOutput directory for ingested files
--contributor(none)Contributor name for metadata
# Ingest a local codebase
weaver graphify ingest ./my-project

# Ingest a tweet
weaver graphify ingest https://x.com/user/status/123 --contributor "analyst"

# Ingest a PDF
weaver graphify ingest https://example.com/report.pdf -o evidence/

weaver graphify query <question>

Search the knowledge graph with a natural-language question. Performs keyword matching against entity labels and source files, returning ranked results with community assignments.

OptionDefaultDescription
-g, --graphgraphify-out/graph.jsonPath to graph JSON
-m, --modebfsTraversal mode: bfs or dfs
-d, --depth3Traversal depth (1-6)
weaver graphify query "database connection pool"
weaver graphify query "AuthService" --mode dfs --depth 4

Output:

Matching nodes:
  [2.0] Database (src=src/db.py, community=0)
  [1.5] ConnectionPool (src=src/pool.py, community=0)
  [1.0] DbConfig (src=src/config.py, community=2)

weaver graphify export <format> [output]

Export the knowledge graph to a file or directory. Loads the graph from JSON, deserializes it into a KnowledgeGraph, and writes the output in the requested format. Supported formats: json, html, graphml, obsidian, wiki, cypher, svg.

OptionDefaultDescription
-o, --outputgraphify-out/<format>Output path
-g, --graphgraphify-out/graph.jsonSource graph JSON
weaver graphify export json
weaver graphify export html -o report.html
weaver graphify export obsidian -o ~/vault/project
weaver graphify export wiki -o docs/wiki
weaver graphify export graphml -o graph.graphml

weaver graphify diff

Compare the current graph against a previous version to see what changed: new/removed nodes and edges.

ArgumentDefaultDescription
oldgraphify-out/graph.json.bakPath to old graph
currentgraphify-out/graph.jsonPath to current graph
weaver graphify diff

Output:

Graph diff:
  Nodes: 142 -> 148 (+6)
  Edges: 387 -> 401 (+14)

weaver graphify rebuild

Force a full re-extraction of the knowledge graph from the project root.

OptionDefaultDescription
root.Root directory to scan
--cleanfalseClear cache before rebuilding
weaver graphify rebuild
weaver graphify rebuild --clean
weaver graphify rebuild /path/to/project

weaver graphify watch

Start a file watcher that monitors the project for changes and triggers automatic re-extraction. Uses polling with configurable debounce.

OptionDefaultDescription
root.Root directory to watch
-d, --debounce2.0Debounce window in seconds
weaver graphify watch
weaver graphify watch --debounce 5.0

The watcher monitors 30+ file extensions (code, documents, images) and ignores .git, node_modules, __pycache__, and graphify-out directories.

weaver graphify hooks install|uninstall|status

Manage git hooks for automatic graph rebuilding after commits and branch switches.

# Install post-commit and post-checkout hooks
weaver graphify hooks install

# Check installation status
weaver graphify hooks status

# Remove graphify hooks (preserves other hook content)
weaver graphify hooks uninstall

post-commit hook: Detects which files changed in the last commit. If any code files changed (.py, .ts, .rs, .go, etc.), runs weaver graphify rebuild automatically.

post-checkout hook: When switching branches, rebuilds the graph if a graphify-out directory exists.

Supported Languages

AST extraction uses tree-sitter grammars, each gated behind a feature flag.

LanguageFeature FlagEntities Extracted
Pythonlang-pythonModule, Class, Function, Import, Constant
JavaScriptlang-javascriptModule, Class, Function, Import
TypeScriptlang-typescriptModule, Class, Function, Interface, Import, Enum
Rustlang-rustModule, Struct, Enum, Function, Import, Constant
Golang-goPackage, Struct, Interface, Function, Import
Javalang-javaPackage, Class, Interface, Function, Import
Clang-cFunction, Struct, Constant, Import
C++lang-cppClass, Function, Struct, Constant, Import
Rubylang-rubyModule, Class, Function, Import
C#lang-csharpClass, Interface, Function, Import

All language features require the ast-extract base feature (pulled in automatically). Use lang-all to enable every grammar.

Extraction is two-pass: (1) structural AST walk to extract declarations and imports, (2) call graph inference to connect function calls across files.

Entity Types

The EntityType enum defines 26 variants across two domains plus shared types.

Code Domain (12 types)

TypeDescriptionExample
ModuleSource file or logical moduleauth.py, src/db
ClassClass definitionAuthService
FunctionFunction or methodvalidate_token()
ImportImport statementimport jwt
ConfigConfiguration file or blocksettings.toml
ServiceService definitionUserService
EndpointAPI endpointPOST /api/login
InterfaceInterface or traitAuthenticator
StructStruct definitionUserRecord
EnumEnum definitionUserRole
ConstantConstant valueMAX_RETRIES
PackagePackage or crateclawft-kernel

Forensic Domain (12 types)

TypeDescriptionExample
PersonIndividualJohn Doe
EventIncident or occurrenceBreak-in at 123 Main St
EvidencePhysical or digital evidenceBloodstain on doorframe
LocationGeographic place123 Main Street
TimelineTemporal sequenceJan 5 - Jan 12 window
DocumentReport or recordPolice report #4521
HypothesisInvestigative theorySuspect entered via rear door
OrganizationCompany or groupAcme Corp
PhysicalObjectTangible itemKitchen knife
DigitalArtifactDigital itemsecurity_cam_01.mp4
FinancialRecordTransaction or accountWire transfer $50K
CommunicationMessage or callPhone call at 11:42 PM

Shared Types (2 + Custom)

TypeDescription
FileSource file node
ConceptAbstract concept
Custom(String)User-defined type

Entity Identity

Every entity receives a deterministic 32-byte ID computed as:

EntityId = BLAKE3(domain_byte || entity_type_discriminant || name || source_file)

The DomainTag byte separates namespaces: 0x20 for Code, 0x21 for Forensic, or a custom byte.

Relationship Types

The RelationType enum defines 23 variants plus Custom(String).

Code Domain (10 types)

TypeDescriptionExample
CallsFunction callauth.validate() calls jwt.decode()
ImportsModule importauth.py imports jwt
ImportsFromSelective importfrom jwt import decode
DependsOnDependencyauth depends on database
ContainsStructural containmentUserModule contains UserService
ImplementsInterface implementationAuthService implements Authenticator
ConfiguresConfiguration linksettings.toml configures DatabasePool
ExtendsInheritanceAdminUser extends User
MethodOfMethod membershipvalidate() is method of AuthService
InstantiatesObject creationmain() instantiates AuthService

Forensic Domain (11 types)

TypeDescriptionExample
WitnessedByWitness observationBreak-in witnessed by Jane Smith
FoundAtEvidence locationBloodstain found at Kitchen
ContradictsConflicting evidenceAlibi contradicts Surveillance footage
CorroboratesSupporting evidencePhone records corroborate Witness statement
AlibiedByAlibi relationshipSuspect alibied by Coworker
PrecedesTemporal orderingPhone call precedes Break-in
DocumentedInDocumentation linkEvidence documented in Police report
OwnedByOwnershipVehicle owned by Suspect
ContactedByCommunication linkVictim contacted by Unknown caller
LocatedAtLocation associationSuspect located at Warehouse
SemanticallySimilarToSemantic similarityWitness A statement similar to Witness B statement

Shared Types (2 + Custom)

TypeDescription
RelatedToGeneral relationship
CaseOfCase association
Custom(String)User-defined type

Confidence Levels

Every relationship carries a Confidence level that affects edge weight in graph algorithms and scoring in exports.

LevelWeightScoreDescription
EXTRACTED1.01.0Deterministically extracted from AST or document structure
INFERRED0.70.5Inferred by LLM or heuristic reasoning
AMBIGUOUS0.40.2Multiple interpretations possible

Weight is used as edge weight in graph algorithms (petgraph) and the CausalGraph bridge. Score is used in the JSON export confidence_score field and coherence calculations.

Analysis Features

God Nodes

Entities with disproportionately high degree (connection count). These represent coupling risks or core abstractions that many other components depend on.

The analysis filters out file-level hub nodes (where the label matches the source filename), method stubs (.method_name()), and concept nodes (entities without a file extension in their source path).

Results are sorted by degree descending and capped at top_n (default: 10).

Surprising Connections

Edges that are structurally non-obvious, scored using five factors:

  1. Confidence weight: AMBIGUOUS (+3), INFERRED (+2), EXTRACTED (+1)
  2. Cross file-type: Connection between code and paper/image files (+2)
  3. Cross-directory: Connection across different top-level directories (+2)
  4. Cross-community: Edge bridges separate communities (+1)
  5. Peripheral-to-hub: Low-degree node unexpectedly reaches a high-degree hub (+1)

Semantic similarity edges receive a 1.5x multiplier. Structural relations (imports, imports_from, contains, method) are excluded.

For multi-file corpora, cross-file edges are ranked by composite surprise score. For single-file corpora, cross-community edges are used instead.

Question Generation

Five strategies produce investigation questions from graph structure:

StrategyTriggerExample Question
Ambiguous edgesEdge with AMBIGUOUS confidence"What is the exact relationship between AuthService and CacheLayer?"
Bridge nodesHigh betweenness centrality"Why does ConfigLoader connect auth to database?"
Verify inferredGod node with 2+ INFERRED edges"Are the 5 inferred relationships involving ApiHandler actually correct?"
Isolated nodesEntities with degree 0-1"What connects HealthCheck, Metrics to the rest of the system?"
Low cohesionCommunity with score < 0.15 and 5+ nodes"Should utils be split into smaller, more focused modules?"

If none of these strategies produce results, a NoSignal question is returned suggesting more files or deeper extraction.

Graph Diff

Compares two graph snapshots (old vs. new) and reports:

  • New nodes and their labels
  • Removed nodes and their labels
  • New edges with relation type and confidence
  • Removed edges with relation type and confidence
  • Summary string (e.g., "6 new nodes, 14 new edges, 2 edges removed")

Edge comparison uses undirected keys (min_hex, max_hex, relation) to avoid order-dependent duplicates.

Community Detection

Label propagation with three post-processing steps.

Algorithm: Each node starts with a unique label. In each iteration, every node adopts the most frequent label among its neighbors (ties broken by smallest label for determinism). Converges when no labels change, with a 50-iteration safety cap.

Oversized splitting: Communities larger than 25% of the graph (and at least 10 nodes) are recursively split by running label propagation on their subgraph.

Re-indexing: Final communities are sorted by size descending so community 0 is always the largest.

Isolates: Nodes with degree 0 each become their own single-node community.

Cohesion scoring: Ratio of actual intra-community edges to maximum possible edges. A complete subgraph scores 1.0, disconnected nodes score 0.0, and a single node scores 1.0 by convention.

Auto-labeling: Communities are labeled with the most common source file stem among their members, falling back to the highest-degree node's label.

Export Formats

FormatExtensionUse Case
JSON.jsonMachine-readable, node_link_data compatible. Import into other tools.
HTML.htmlInteractive vis.js visualization. Open in a browser. Requires html-export feature.
GraphML.graphmlXML interchange format. Import into Gephi, yEd, Cytoscape.
Obsidianvault + .canvasObsidian vault with one note per entity plus a canvas file for visual layout.
Wiki.md directoryWikipedia-style markdown wiki with interlinked pages.
Cypher.cypherNeo4j Cypher statements. Import into Neo4j. Requires neo4j-export feature.
SVG.svgStatic vector graph rendering.

Caching

Extraction results are cached using BLAKE3 content hashes of the source file. On re-extraction, only files whose content hash has changed are re-processed.

Cache directory: .weftos/graphify-cache/ (configurable via PipelineConfig::cache_dir).

The --clean flag on weaver graphify rebuild clears the cache before rebuilding.

URL Ingestion

The ingest module fetches URLs, classifies them, and saves annotated markdown (or raw files) for extraction into the graph.

Supported URL Types

TypeDetectionOutput
Tweettwitter.com or x.comMarkdown with tweet text via oEmbed API
arXivarxiv.orgMarkdown with title, authors, abstract
PDF.pdf extensionRaw PDF file
Image.png, .jpg, .jpeg, .webp, .gifRaw image file
GitHubgithub.comWebpage markdown
YouTubeyoutube.com or youtu.beWebpage markdown
WebpageEverything elseHTML stripped to text, truncated to 12K chars

SSRF Protection

URL validation blocks:

  • Non-HTTP schemes (file://, ftp://, etc.)
  • Localhost (127.0.0.1, localhost, ::1)
  • Private IP ranges (10.x.x.x, 172.16-31.x.x, 192.168.x.x)

Query Result Feedback

Query answers can be saved as markdown in the memory directory for re-extraction into the graph, creating a feedback loop where questions and answers become part of the knowledge base.

File Watcher

The polling-based file watcher monitors a directory tree for changes and triggers re-extraction.

Watched extensions (30+): .py, .ts, .js, .go, .rs, .java, .cpp, .c, .rb, .swift, .kt, .cs, .scala, .php, .md, .txt, .rst, .pdf, .png, .jpg, and more.

Ignored directories: .git, graphify-out, __pycache__, node_modules, and any directory starting with ..

Debounce: Changes are batched with a configurable debounce window (default 2 seconds) before triggering a rebuild. Code-only changes trigger AST-only re-extraction; non-code changes (documents, images) trigger full re-extraction.

Git Hooks

Two hooks are installed into .git/hooks/:

post-commit: Checks git diff --name-only HEAD~1 HEAD for code file changes. If any code files changed, runs weaver graphify rebuild. Detects 18 code extensions.

post-checkout: When a branch switch occurs ($3 == 1) and a graphify-out directory exists, runs weaver graphify rebuild.

Hooks are marker-delimited (# graphify-hook-start / # graphify-hook-end) so they can be cleanly appended to existing hook files and removed without affecting other hook content.

Feature Flags

FeatureDependenciesDescription
code-domain(default)Code analysis entity and relationship types
forensic-domain(none)Forensic analysis entity and relationship types, gap analysis, coherence scoring
ast-extracttree-sitterAST-based entity extraction (required by all lang-* features)
lang-pythontree-sitter-pythonPython grammar
lang-javascripttree-sitter-javascriptJavaScript grammar
lang-typescripttree-sitter-typescriptTypeScript grammar
lang-rusttree-sitter-rustRust grammar
lang-gotree-sitter-goGo grammar
lang-javatree-sitter-javaJava grammar
lang-ctree-sitter-cC grammar
lang-cpptree-sitter-cppC++ grammar
lang-rubytree-sitter-rubyRuby grammar
lang-csharptree-sitter-c-sharpC# grammar
lang-allAll lang-* featuresEnable every language grammar
semantic-extract(none)LLM-based semantic extraction for documents
vision-extract(none)Vision model extraction for images
html-exporthtml-escapeInteractive HTML/vis.js export
neo4j-export(none)Neo4j Cypher export
kernel-bridgeclawft-kernel, async-traitCausalGraph, HNSW, CrossRef bridge
fullAll of the aboveEnable everything

Sprint 17: Knowledge Graph Intelligence

Sprint 17 adds advanced graph analysis, exploration, and maintenance capabilities to Graphify.

Community Summaries (KG-002)

GraphRAG-style summary generation produces human-readable descriptions of each detected community. The summarizer analyzes member entities, their relationships, and structural patterns to generate a narrative summary suitable for reports and conversational answers.

weaver graphify rebuild   # summaries generated automatically
cat graphify-out/GRAPH_REPORT.md  # includes community summaries

Data Flow Tracing (KG-006)

BFS dependency flow tracing follows data through call chains from a source entity to all reachable sinks. Useful for understanding how a configuration value propagates through the system or how a security-sensitive input reaches storage.

let flows = trace_data_flow(&graph, source_id, max_depth);
// Returns ordered list of paths with edge types

Entity Deduplication (KG-008)

Detects and merges duplicate entities using a combination of Levenshtein string distance on labels and structural similarity (shared neighbors, same community). Configurable similarity threshold (default 0.85) controls aggressiveness.

Entity Alignment (KG-015)

Cross-graph entity matching for merging knowledge graphs from different sources. Uses label similarity (normalized Levenshtein) combined with structural similarity (Jaccard index on neighbor sets) to find corresponding entities across graphs.

let alignments = align_entities(&graph_a, &graph_b, threshold);
// Returns Vec<(EntityId, EntityId, f64)> — matched pairs with confidence

Conversational Exploration (KG-016)

Stateful multi-turn dialogue interface for exploring knowledge graphs interactively. Maintains conversation context (visited nodes, active filters, current community focus) across turns. Supports natural-language queries that are translated into graph traversals.

MCTS Graph Exploration (KG-007)

Monte Carlo Tree Search for knowledge graph exploration. Uses UCB1 for node selection and random rollouts to discover non-obvious paths and structural patterns. Useful for hypothesis generation in forensic and research domains.

Multi-hop Beam Search (KG-010)

Prioritized multi-hop traversal with edge-type priors. Unlike BFS which explores uniformly, beam search maintains a priority queue weighted by relationship types and confidence levels, finding the most relevant paths first.

Incremental Updates

Efficient delta-based graph rebuilds that only re-extract changed files and update affected edges. Combined with the file watcher and git hooks, this enables near-real-time graph maintenance on large codebases without full rebuilds.

Newman Modularity (KG-018)

Global partition quality metric (Q score) measuring how well the detected communities reflect the actual graph structure. Ranges from -0.5 to 1.0, where values above 0.3 indicate meaningful community structure.

EML Score Fusion (KG-001)

Hybrid query scoring that combines four signals: keyword match (TF-IDF on entity labels), graph proximity (shortest path distance), community membership (same-community bonus), and entity type matching. Produces a unified relevance score for search results.

Sonobuoy Sensor Graph (KG-013)

Specialized graph construction for time-series sensor data using GraphSAGE-style neighbor aggregation and temporal features. Designed for IoT and edge deployments where sensor readings flow through the knowledge graph with time-decay weighting.

Integration with WeftOS

CausalGraph Bridge

With the kernel-bridge feature, the GraphifyBridge maps the KnowledgeGraph into the ECC subsystems:

  • CausalGraph: Each entity becomes a causal node. Relationships are mapped to CausalEdgeType variants (Calls/Imports to Causes, Contains to Enables, Contradicts to Contradicts, Corroborates to EvidenceFor, Precedes to Follows, AlibiedBy to Inhibits, remainder to Correlates).
  • HNSW: Entity labels and types are embedded into the HNSW vector index for semantic similarity search.
  • CrossRefStore: Each entity and relationship gets a CrossRef entry with a Graphify-namespaced UniversalNodeId (structure tag 0x20). Relationship types are preserved as custom discriminants in the 0x20..0x3F range.

The bridge supports bidirectional operation: ingest() pushes a KnowledgeGraph into ECC, and export_from_causal() reconstructs a KnowledgeGraph from the CausalGraph.

9th Assessment Analyzer

The GraphifyAnalyzer is the 9th analyzer in the WeftOS assessment pipeline. It produces findings in three categories:

  • Complexity: God nodes with high connection counts flagged as coupling risks
  • Dependencies: Surprising cross-community connections flagged as unexpected coupling
  • Architecture: Singleton communities (isolated entities) flagged as architectural concerns

HNSW Indexing

Entity labels are embedded and indexed in the HNSW service, enabling semantic search queries like "find entities similar to authentication handler" without exact keyword matching.

On this page