Knowledge Pipeline
The knowledge pipeline is MARC27’s ontology engine. It transforms raw data (papers, datasets, ontologies) into a structured knowledge graph with semantic search. This is the core product — customers can build their own ontologies on their own infrastructure.
Architecture
Section titled “Architecture”Input sources (R2, HF, URLs, local files) ↓Text extraction (PyMuPDF for PDFs, direct read for text) ↓Chunking (~4K chars, respecting page boundaries) ↓Embedding (Gemini text-embedding-004 or local BGE-M3) → stored in pgvector with corpus_id, doc_id, metadata ↓Entity extraction (LLM: GLM-4.5-Air free, or any model) → entities + relationships stored in Neo4j ↓Knowledge graph ready for searchData Sources
Section titled “Data Sources”Current Graph
Section titled “Current Graph”| Source | Nodes | Edges | Description |
|---|---|---|---|
| MatKG | 69,618 | 5.38M | Curated from 5M materials science papers (MatScholar) |
| Materials Project | 210K+ | merged | Crystal structures with typed properties |
| NASA Propulsion | — | — | 444 papers, text embedded in pgvector |
Entity Types
Section titled “Entity Types”| Code | Label | Examples |
|---|---|---|
| CHM | Chemical | Nickel, Silicon Carbide, Ti-6Al-4V |
| MAT | Material | With typed properties: band_gap, density, crystal_system |
| PRO | Property | Creep Resistance, Tensile Strength, Band Gap |
| CMT | Characterization Method | XRD, SEM, DFT |
| APL | Application | Turbine Blade, Heat Shield |
| PHS | Phase | Austenite, Martensite |
| DSC | Descriptor | Nanostructured, Amorphous |
How Ontologies Are Built
Section titled “How Ontologies Are Built”1. Backbone Import (batch)
Section titled “1. Backbone Import (batch)”Pre-curated datasets imported directly into Neo4j:
# MatKG — curated graph from 5M papersPOST /graph/seed {"nodes_url": "<presigned R2 URL>", "edges_url": "<presigned R2 URL>"}
# Materials Project — structured propertiesPOST /graph/ingest {"entities": [{"name": "Si", "entity_type": "MAT", "properties": {"band_gap": 1.12}}]}2. Paper Ingestion (batch or on-the-fly)
Section titled “2. Paper Ingestion (batch or on-the-fly)”prism ingest ./papers/ --corpus my-projectFor each file:
- Extract text: PyMuPDF for PDFs (free, local)
- Chunk: ~4K chars per chunk, page boundaries preserved
- Embed: Gemini text-embedding-004 (cheap) or BGE-M3 (local, free)
- Extract entities: Free LLM (GLM-4.5-Air) identifies chemicals, materials, properties
- Store: entities → Neo4j, embeddings → pgvector
3. On-the-fly (RLM Research)
Section titled “3. On-the-fly (RLM Research)”During a research query, the LLM can call:
# In REPL: web search finds a paperpapers = web_search("ablative thermal protection", limit=5)
# Ingest it permanently into the knowledge graphresult = ingest_paper("https://arxiv.org/pdf/2401.12345")# → downloads, extracts, embeds, returns {chunks: 15, pages: 8}
# Now search finds itdocs = vector_search("ablative heat shield PICA")# → returns the paper content we just ingestedEvery research query can expand the graph. The flywheel: more queries → more papers → richer graph → better answers.
Tenant Isolation
Section titled “Tenant Isolation”Every entity and embedding is scoped by tenant:
public— shared knowledge (MatKG, Materials Project)org:{org_id}— organization-private (e.g., ITER’s data)user:{user_id}— user-private
Queries are filtered by tenant. A user sees: their private data + their org’s data + public data.
Corpus System
Section titled “Corpus System”Each data source is a corpus with a UUID:
00000000-0000-4000-c000-000000000001— default/public00000000-0000-4000-c000-000000000002— NASA propulsion- Custom UUIDs for customer corpora
Embeddings carry corpus_id for filtering. Searches can be corpus-scoped.
API Endpoints
Section titled “API Endpoints”| Endpoint | Method | Description |
|---|---|---|
/graph/search | GET | Search entities by name |
/graph/entity/{name} | GET | Get entity neighbors |
/graph/stats | GET | Node/edge/type counts |
/graph/ingest | POST | Add entities + relationships |
/graph/seed | POST | Bulk import from CSV URLs |
/search | POST | Semantic vector search |
/embed | POST | Embed single document |
/embed/bulk | POST | Embed batch of documents |
/research/query | POST | RLM research (SSE stream) |
/ingest | POST | Full pipeline: extract + embed + graph |
For Enterprise Customers
Section titled “For Enterprise Customers”prism node up— starts the runtime on their infrastructure- Upload data to their node’s local storage
prism ingest ./data/ --corpus my-project— builds their ontology- Data never leaves their network (air-gapped mode uses BGE-M3 locally)
- Their Neo4j + pgvector are theirs — full export available