Concept Graphs is a Flask-based API for building, storing, inspecting, and extending concept graphs from document corpora.
The application processes text documents through a pipeline:
- Preprocessing: extracts noun chunks / phrases from documents.
- Embedding: embeds extracted phrases into a vector space.
- Clustering: groups related phrases into concept clusters.
- Graph creation: creates one graph per concept cluster.
- Optional integration: stores phrase/document information in an external vector store.
- Optional RAG: initializes a retrieval-augmented generation component over processed documents.
The API also exposes endpoints to inspect process status, retrieve graph data, add documents to existing graphs, and ask questions via RAG.
The implementation is based on the Concept Graphs approach described in the references below [1].
The project uses Python 3.11 and uv.
Install dependencies with:
uv syncStart the API directly:
uv run python main.pyBy default, the Flask application listens on:
http://localhost:9010
The OpenAPI UI is available at:
http://localhost:9010/
http://localhost:9010/openapi
Build and start the services:
docker compose build
docker compose up -dDepending on the compose configuration, the API is usually exposed at:
http://localhost:9007
Generated results are written to the configured storage directory inside the container, typically mounted from the results Docker volume.
Most API operations are associated with a process name. A process represents one corpus and its generated pipeline artifacts.
If no process is supplied, the API uses:
default
Use the process query parameter to select a process:
curl "http://localhost:9010/status?process=my_corpus"Process names are normalized by the server.
The OpenAPI specification is served by the application UI and is defined in:
api/concept-graphs-api.yml
Configured server URLs include:
http://top-prod:9007
http://localhost:9007
http://localhost:9010
Starts a complete concept-graph pipeline.
The endpoint accepts either:
multipart/form-dataapplication/json
A pipeline can read documents from:
- an uploaded zip file, or
- an external document server.
If a vector store is configured and reachable, the API can additionally run an integration step. If the vector store is not reachable, the API falls back to pickle-based storage where possible.
| Name | Type | Default | Description |
|---|---|---|---|
process |
string | default |
Name of the corpus/process. |
language |
string | en |
Language of the documents. Common values are en and de. |
skip_present |
boolean | true |
Skip already completed serialized steps. |
skip_steps |
string | Comma-separated list of steps to skip. Supported values include data, embedding, clustering, graph, and integration. |
|
return_statistics |
boolean | false |
If true, waits for the pipeline to finish and returns graph statistics. This may take a long time. |
curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en&skip_present=true" \
-F data=@"./documents.zip" \
-F data_config=@"./data-config.yaml" \
-F embedding_config=@"./embedding-config.yaml" \
-F clustering_config=@"./clustering-config.yaml" \
-F graph_config=@"./graph-config.yaml"Supported multipart fields:
| Field | Required | Description |
|---|---|---|
data |
conditionally | Zip file containing input text documents. Required unless document_server_config is provided. |
document_server_config |
conditionally | YAML config for loading documents from an external document server. Required unless data is provided. |
vectorstore_server_config |
no | YAML config for an external vector store. |
labels |
no | YAML file mapping document names/ids to labels. |
data_config |
no | Preprocessing configuration. |
embedding_config |
no | Embedding configuration. |
clustering_config |
no | Clustering configuration. |
graph_config |
no | Graph creation configuration. |
curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en" \
-H "Content-Type: application/json" \
-d @pipeline-config.jsonA JSON pipeline configuration may contain:
{
"name": "my_corpus",
"language": "en",
"document_server": {
"url": "http://localhost",
"port": 9008,
"index": "documents",
"size": 30,
"other_id": "id",
"label_key": "label",
"replace_keys": {
"text": "content"
}
},
"vectorstore_server": {
"url": "http://localhost",
"port": 8882
},
"config": {
"data": {},
"embedding": {},
"clustering": {},
"graph": {}
}
}If return_statistics=false, the endpoint starts the pipeline asynchronously and returns 202 Accepted with the current process status.
If return_statistics=true, the endpoint waits for the pipeline thread to finish and returns graph statistics.
Returns either a default pipeline configuration or a stored configuration for a process.
curl "http://localhost:9010/pipeline/configuration?default=true&language=en"Query parameters:
| Name | Type | Default | Description |
|---|---|---|---|
default |
boolean | true |
If true, returns the default configuration for the selected language. |
process |
string | default |
Process name, used when default=false. |
language |
string | en |
Language for the default configuration. |
Examples:
curl "http://localhost:9010/pipeline/configuration?default=true&language=en"
curl "http://localhost:9010/pipeline/configuration?default=false&process=my_corpus"The standalone preprocessing creation endpoint is no longer exposed. Preprocessing is run through /pipeline.
The following endpoints inspect stored preprocessing results.
Returns basic statistics for a processed corpus.
curl "http://localhost:9010/preprocessing/statistics?process=my_corpus"Returns extracted noun chunks / phrase chunks.
curl "http://localhost:9010/preprocessing/noun_chunks?process=my_corpus"Embedding is run through /pipeline.
Returns statistics for the stored embedding object.
curl "http://localhost:9010/embedding/statistics?process=my_corpus"Clustering is run through /pipeline.
Returns the concepts found during clustering.
curl "http://localhost:9010/clustering/concepts?process=my_corpus&top_k=15&distance=0.6"Query parameters:
| Name | Type | Default | Description |
|---|---|---|---|
process |
string | default |
Process name. |
top_k |
integer | 15 |
Number of representative phrases to return for each concept. |
distance |
number | 0.6 |
Cosine distance threshold for representatives. |
Graph creation is run through /pipeline.
Returns basic graph statistics for a process.
curl "http://localhost:9010/graph/statistics?process=my_corpus"Returns nodes and adjacency information for a specific graph.
curl "http://localhost:9010/graph/0?process=my_corpus"To request a rendered graph where supported:
curl "http://localhost:9010/graph/0?process=my_corpus&draw=true"Query parameters:
| Name | Type | Default | Description |
|---|---|---|---|
process |
string | default |
Process name. |
draw |
boolean | false |
If true, returns a rendered graph instead of JSON where supported. |
Adds one or more documents to an existing process and integrates their phrases into the graphs built for that corpus.
The request body must be JSON.
curl -X POST "http://localhost:9010/graph/document/add?process=my_corpus" \
-H "Content-Type: application/json" \
-d '{
"language": "en",
"documents": [
{
"id": "doc-001",
"name": "example.txt",
"content": "The document text goes here.",
"label": "optional-label"
}
],
"vectorstore_server": {
"url": "http://localhost",
"port": 8882
}
}'Request fields:
| Field | Type | Required | Description |
|---|---|---|---|
language |
string | yes | Document language. |
documents |
array | yes | Documents to add. |
documents[].id |
string | no | External document id. |
documents[].name |
string | yes | Document name. |
documents[].content |
string | yes | Document text. |
documents[].label |
string | no | Optional document label. |
vectorstore_server |
object | no | Vector store connection settings. |
document_server |
object | no | Reserved for document-server based additions. |
The endpoint starts an asynchronous document-addition thread.
Returns the status or result of a document-addition task.
curl "http://localhost:9010/graph/document/add/status?process=my_corpus"Possible responses include:
200 OK: task finished and returned a result202 Accepted: task is still running404 Not Found: no document-addition task exists for the process
The path exists internally but document deletion is not implemented.
Returns all known stored processes.
curl "http://localhost:9010/processes"Returns the status of a specific process.
curl "http://localhost:9010/status?process=my_corpus"Requests that a running process be stopped.
curl "http://localhost:9010/processes/my_corpus/stop"Optional query parameter:
| Name | Type | Default | Description |
|---|---|---|---|
hard_stop |
boolean | false |
If false, attempts a graceful stop. |
Deletes a process from the in-memory cache and removes serialized artifacts for finished steps.
curl -X DELETE "http://localhost:9010/processes/my_corpus/delete"Optional query parameter:
| Name | Type | Default | Description |
|---|---|---|---|
hard_stop |
boolean | false |
If the process is running, stop it before deletion. |
Checks whether a configured document server is reachable and contains data.
JSON example:
curl -X POST "http://localhost:9010/status/document-server" \
-H "Content-Type: application/json" \
-d '{
"url": "http://localhost",
"port": 9008,
"index": "documents",
"size": 30
}'Multipart example:
curl -X POST "http://localhost:9010/status/document-server" \
-F document_server_config=@"./document-server-config.yaml"A typical document server configuration contains:
url: "http://localhost"
port: 9008
index: "documents"
size: 30
other_id: "id"
label_key: "label"
replace_keys:
text: contentThe API can initialize one active RAG component for a process and answer questions over retrieved document chunks.
Initializes the RAG component.
curl -X POST "http://localhost:9010/rag/init?process=my_corpus&force=false" \
-H "Content-Type: application/json" \
-d '{
"api_key": "",
"language": "en",
"vectorstore_server": {
"url": "http://localhost",
"port": 8882
},
"chatter": {
"chatter": "src.rag.chatters.BlabladorChatter.BlabladorChatter"
},
"prompt_template": {
"templates": {
"en": "Answer the question using the context: {context}\nQuestion: {question}"
},
"input_variables": ["context", "question"]
}
}'Query parameters:
| Name | Type | Default | Description |
|---|---|---|---|
process |
string | default |
Process name. |
force |
boolean | false |
Reinitialize/refill the vector-store index even if it already exists. |
If the backing chunk vector store is empty, initialization starts a background task to fill it. The RAG component becomes ready after initialization completes.
Checks whether the active RAG component is ready for a process.
curl "http://localhost:9010/status/rag?process=my_corpus"Asks a question using the active RAG component.
curl "http://localhost:9010/rag/question?process=my_corpus&q=What%20is%20this%20corpus%20about%3F"Asks a question and optionally restricts retrieval to selected document ids.
curl -X POST "http://localhost:9010/rag/question?process=my_corpus&q=What%20does%20document%20A%20say%3F" \
-H "Content-Type: application/json" \
-d '{
"doc_ids": ["doc-001", "doc-002"],
"limit": 15
}'Request body fields:
| Field | Type | Default | Description |
|---|---|---|---|
doc_ids |
array of strings | [] |
Optional list of document ids to restrict retrieval. |
limit |
integer | 15 |
Maximum number of document chunks to retrieve. |
A successful response contains:
{
"answer": "...",
"info": "..."
}info contains serialized metadata for the retrieved reference documents.
The recommended way to retrieve a complete current configuration is:
curl "http://localhost:9010/pipeline/configuration?default=true&language=en"The sections below show the main configuration concepts.
{
"name": "default",
"language": "en",
"document_server": {
"url": "http://localhost",
"port": 9008,
"index": "documents",
"size": 30,
"other_id": "id",
"label_key": "label",
"replace_keys": {
"text": "content"
}
},
"vectorstore_server": {
"url": "http://localhost",
"port": 8882
},
"config": {
"data": {
"spacy_model": "en_core_web_trf",
"n_process": 1,
"file_extension": "txt",
"file_encoding": "utf-8",
"use_lemma": false,
"prepend_head": false,
"head_only": false,
"case_sensitive": false,
"disable": null,
"tfidf_filter": {
"enabled": false,
"min_df": 1,
"max_df": 1,
"stop": null
},
"negspacy": {
"enabled": true,
"configuration": {
"scope": 1,
"language": "en",
"feat_of_interest": "NC"
}
}
},
"embedding": {
"model": "sentence-transformers/paraphrase-albert-small-v2",
"n_process": 1,
"storage": {
"method": "vectorstore",
"config": {
"normalizeEmbeddings": false,
"annParameters": {
"spaceType": "dotproduct",
"parameters": {
"efConstruction": 1024,
"m": 16
}
}
}
}
},
"clustering": {
"algorithm": "kmeans",
"downscale": "umap",
"missing_as_recommended": true,
"deduction": {
"enabled": true,
"k_min": 2,
"k_max": 100,
"n_samples": 15,
"sample_fraction": 25,
"regression_poly_degree": 5
},
"scaling": {
"n_neighbors": 10,
"n_components": 100,
"min_dist": 0.1
},
"clustering": {}
},
"graph": {
"cluster": {
"distance": 0.7,
"min_size": 4
},
"graph": {
"cosine_weight": 0.6,
"merge_threshold": 0.9,
"graph_weight_cut_off": 0.6,
"unroll": false,
"simplify": 0.5,
"simplify_alg": "significance",
"sub_clustering": false
},
"restrict_to_cluster": true
}
}
}url: "http://localhost"
port: 9008
index: "documents"
size: 30
other_id: "id"
label_key: "label"
replace_keys:
text: contenturl: "http://localhost"
port: 8882Pipeline artifacts are stored below the configured file storage directory, which defaults to:
tmp/
Each process has its own subdirectory.
Depending on configuration and reachable external services, embeddings and integration data may be stored either through the configured vector store or serialized locally.
Start a full pipeline from an uploaded zip file:
curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en&return_statistics=false" \
-F data=@"./documents.zip"Check progress:
curl "http://localhost:9010/status?process=my_corpus"Inspect concepts:
curl "http://localhost:9010/clustering/concepts?process=my_corpus&top_k=10"Inspect graph statistics:
curl "http://localhost:9010/graph/statistics?process=my_corpus"Fetch a graph:
curl "http://localhost:9010/graph/0?process=my_corpus"Initialize RAG:
curl -X POST "http://localhost:9010/rag/init?process=my_corpus" \
-H "Content-Type: application/json" \
-d @rag-config.jsonAsk a question:
curl "http://localhost:9010/rag/question?process=my_corpus&q=Summarize%20the%20main%20topics."- Pipeline execution can take a long time for large corpora.
- Some operations run asynchronously. Use
/status,/processes,/graph/document/add/status, and/status/ragto inspect progress. - Only one active RAG component is held by the application at a time.
- Document deletion from graphs is not implemented.
- Graph quality depends heavily on corpus size, extracted phrase quality, embeddings, and clustering settings.
- Very small corpora may not produce useful concept clusters or graphs.
[1] Matthies, F. et al. Concept Graphs: A Novel Approach for Textual Analysis of Medical Documents. In: Röhrig, R. et al., editors. Studies in Health Technology and Informatics. IOS Press; 2023. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI230710
[2] NetworkX: https://networkx.org/
[3] Dianati, N. Unwinding the hairball graph: Pruning algorithms for weighted complex networks. Physical Review E. 2016;93(1). Available from: https://link.aps.org/doi/10.1103/PhysRevE.93.012304