Skip to content

Knowledge Pipeline

The knowledge pipeline is MARC27’s ontology engine. It transforms raw data (papers, datasets, ontologies) into a structured knowledge graph with semantic search. This is the core product — customers can build their own ontologies on their own infrastructure.

Input sources (R2, HF, URLs, local files)
Text extraction (PyMuPDF for PDFs, direct read for text)
Chunking (~4K chars, respecting page boundaries)
Embedding (Gemini text-embedding-004 or local BGE-M3)
→ stored in pgvector with corpus_id, doc_id, metadata
Entity extraction (LLM: GLM-4.5-Air free, or any model)
→ entities + relationships stored in Neo4j
Knowledge graph ready for search
SourceNodesEdgesDescription
MatKG69,6185.38MCurated from 5M materials science papers (MatScholar)
Materials Project210K+mergedCrystal structures with typed properties
NASA Propulsion444 papers, text embedded in pgvector
CodeLabelExamples
CHMChemicalNickel, Silicon Carbide, Ti-6Al-4V
MATMaterialWith typed properties: band_gap, density, crystal_system
PROPropertyCreep Resistance, Tensile Strength, Band Gap
CMTCharacterization MethodXRD, SEM, DFT
APLApplicationTurbine Blade, Heat Shield
PHSPhaseAustenite, Martensite
DSCDescriptorNanostructured, Amorphous

Pre-curated datasets imported directly into Neo4j:

Terminal window
# MatKG — curated graph from 5M papers
POST /graph/seed
{"nodes_url": "<presigned R2 URL>", "edges_url": "<presigned R2 URL>"}
# Materials Project — structured properties
POST /graph/ingest
{"entities": [{"name": "Si", "entity_type": "MAT", "properties": {"band_gap": 1.12}}]}
Terminal window
prism ingest ./papers/ --corpus my-project

For each file:

  1. Extract text: PyMuPDF for PDFs (free, local)
  2. Chunk: ~4K chars per chunk, page boundaries preserved
  3. Embed: Gemini text-embedding-004 (cheap) or BGE-M3 (local, free)
  4. Extract entities: Free LLM (GLM-4.5-Air) identifies chemicals, materials, properties
  5. Store: entities → Neo4j, embeddings → pgvector

During a research query, the LLM can call:

# In REPL: web search finds a paper
papers = web_search("ablative thermal protection", limit=5)
# Ingest it permanently into the knowledge graph
result = ingest_paper("https://arxiv.org/pdf/2401.12345")
# → downloads, extracts, embeds, returns {chunks: 15, pages: 8}
# Now search finds it
docs = vector_search("ablative heat shield PICA")
# → returns the paper content we just ingested

Every research query can expand the graph. The flywheel: more queries → more papers → richer graph → better answers.

Every entity and embedding is scoped by tenant:

  • public — shared knowledge (MatKG, Materials Project)
  • org:{org_id} — organization-private (e.g., ITER’s data)
  • user:{user_id} — user-private

Queries are filtered by tenant. A user sees: their private data + their org’s data + public data.

Each data source is a corpus with a UUID:

  • 00000000-0000-4000-c000-000000000001 — default/public
  • 00000000-0000-4000-c000-000000000002 — NASA propulsion
  • Custom UUIDs for customer corpora

Embeddings carry corpus_id for filtering. Searches can be corpus-scoped.

EndpointMethodDescription
/graph/searchGETSearch entities by name
/graph/entity/{name}GETGet entity neighbors
/graph/statsGETNode/edge/type counts
/graph/ingestPOSTAdd entities + relationships
/graph/seedPOSTBulk import from CSV URLs
/searchPOSTSemantic vector search
/embedPOSTEmbed single document
/embed/bulkPOSTEmbed batch of documents
/research/queryPOSTRLM research (SSE stream)
/ingestPOSTFull pipeline: extract + embed + graph
  1. prism node up — starts the runtime on their infrastructure
  2. Upload data to their node’s local storage
  3. prism ingest ./data/ --corpus my-project — builds their ontology
  4. Data never leaves their network (air-gapped mode uses BGE-M3 locally)
  5. Their Neo4j + pgvector are theirs — full export available