Concept Graphs

Concept Graphs is a Flask-based API for building, storing, inspecting, and extending concept graphs from document corpora.

The application processes text documents through a pipeline:

Preprocessing: extracts noun chunks / phrases from documents.
Embedding: embeds extracted phrases into a vector space.
Clustering: groups related phrases into concept clusters.
Graph creation: creates one graph per concept cluster.
Optional integration: stores phrase/document information in an external vector store.
Optional RAG: initializes a retrieval-augmented generation component over processed documents.

The API also exposes endpoints to inspect process status, retrieve graph data, add documents to existing graphs, and ask questions via RAG.

The implementation is based on the Concept Graphs approach described in the references below [1].

Requirements

The project uses Python 3.11 and uv.

Install dependencies with:

uv sync

Running locally

Start the API directly:

uv run python main.py

By default, the Flask application listens on:

http://localhost:9010

The OpenAPI UI is available at:

http://localhost:9010/
http://localhost:9010/openapi

Docker

Build and start the services:

docker compose build
docker compose up -d

Depending on the compose configuration, the API is usually exposed at:

http://localhost:9007

Generated results are written to the configured storage directory inside the container, typically mounted from the results Docker volume.

Data model and processes

Most API operations are associated with a process name. A process represents one corpus and its generated pipeline artifacts.

If no process is supplied, the API uses:

default

Use the process query parameter to select a process:

curl "http://localhost:9010/status?process=my_corpus"

Process names are normalized by the server.

OpenAPI specification

The OpenAPI specification is served by the application UI and is defined in:

api/concept-graphs-api.yml

Configured server URLs include:

http://top-prod:9007
http://localhost:9007
http://localhost:9010

Pipeline

`POST /pipeline`

Starts a complete concept-graph pipeline.

The endpoint accepts either:

multipart/form-data
application/json

A pipeline can read documents from:

an uploaded zip file, or
an external document server.

If a vector store is configured and reachable, the API can additionally run an integration step. If the vector store is not reachable, the API falls back to pickle-based storage where possible.

Query parameters

Name	Type	Default	Description
`process`	string	`default`	Name of the corpus/process.
`language`	string	`en`	Language of the documents. Common values are `en` and `de`.
`skip_present`	boolean	`true`	Skip already completed serialized steps.
`skip_steps`	string		Comma-separated list of steps to skip. Supported values include `data`, `embedding`, `clustering`, `graph`, and `integration`.
`return_statistics`	boolean	`false`	If `true`, waits for the pipeline to finish and returns graph statistics. This may take a long time.

Multipart request

curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en&skip_present=true" \
  -F data=@"./documents.zip" \
  -F data_config=@"./data-config.yaml" \
  -F embedding_config=@"./embedding-config.yaml" \
  -F clustering_config=@"./clustering-config.yaml" \
  -F graph_config=@"./graph-config.yaml"

Supported multipart fields:

Field	Required	Description
`data`	conditionally	Zip file containing input text documents. Required unless `document_server_config` is provided.
`document_server_config`	conditionally	YAML config for loading documents from an external document server. Required unless `data` is provided.
`vectorstore_server_config`	no	YAML config for an external vector store.
`labels`	no	YAML file mapping document names/ids to labels.
`data_config`	no	Preprocessing configuration.
`embedding_config`	no	Embedding configuration.
`clustering_config`	no	Clustering configuration.
`graph_config`	no	Graph creation configuration.

JSON request

curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en" \
  -H "Content-Type: application/json" \
  -d @pipeline-config.json

A JSON pipeline configuration may contain:

{
  "name": "my_corpus",
  "language": "en",
  "document_server": {
    "url": "http://localhost",
    "port": 9008,
    "index": "documents",
    "size": 30,
    "other_id": "id",
    "label_key": "label",
    "replace_keys": {
      "text": "content"
    }
  },
  "vectorstore_server": {
    "url": "http://localhost",
    "port": 8882
  },
  "config": {
    "data": {},
    "embedding": {},
    "clustering": {},
    "graph": {}
  }
}

Response

If return_statistics=false, the endpoint starts the pipeline asynchronously and returns 202 Accepted with the current process status.

If return_statistics=true, the endpoint waits for the pipeline thread to finish and returns graph statistics.

Pipeline configuration

`GET /pipeline/configuration`

Returns either a default pipeline configuration or a stored configuration for a process.

curl "http://localhost:9010/pipeline/configuration?default=true&language=en"

Query parameters:

Name	Type	Default	Description
`default`	boolean	`true`	If `true`, returns the default configuration for the selected language.
`process`	string	`default`	Process name, used when `default=false`.
`language`	string	`en`	Language for the default configuration.

Examples:

curl "http://localhost:9010/pipeline/configuration?default=true&language=en"
curl "http://localhost:9010/pipeline/configuration?default=false&process=my_corpus"

Preprocessing inspection

The standalone preprocessing creation endpoint is no longer exposed. Preprocessing is run through /pipeline.

The following endpoints inspect stored preprocessing results.

`GET /preprocessing/statistics`

Returns basic statistics for a processed corpus.

curl "http://localhost:9010/preprocessing/statistics?process=my_corpus"

`GET /preprocessing/noun_chunks`

Returns extracted noun chunks / phrase chunks.

curl "http://localhost:9010/preprocessing/noun_chunks?process=my_corpus"

Embedding inspection

Embedding is run through /pipeline.

`GET /embedding/statistics`

Returns statistics for the stored embedding object.

curl "http://localhost:9010/embedding/statistics?process=my_corpus"

Clustering inspection

Clustering is run through /pipeline.

`GET /clustering/concepts`

Returns the concepts found during clustering.

curl "http://localhost:9010/clustering/concepts?process=my_corpus&top_k=15&distance=0.6"

Query parameters:

Name	Type	Default	Description
`process`	string	`default`	Process name.
`top_k`	integer	`15`	Number of representative phrases to return for each concept.
`distance`	number	`0.6`	Cosine distance threshold for representatives.

Graphs

Graph creation is run through /pipeline.

`GET /graph/statistics`

Returns basic graph statistics for a process.

curl "http://localhost:9010/graph/statistics?process=my_corpus"

`GET /graph/{graph_id}`

Returns nodes and adjacency information for a specific graph.

curl "http://localhost:9010/graph/0?process=my_corpus"

To request a rendered graph where supported:

curl "http://localhost:9010/graph/0?process=my_corpus&draw=true"

Query parameters:

Name	Type	Default	Description
`process`	string	`default`	Process name.
`draw`	boolean	`false`	If `true`, returns a rendered graph instead of JSON where supported.

Adding documents to existing graphs

`POST /graph/document/add`

Adds one or more documents to an existing process and integrates their phrases into the graphs built for that corpus.

The request body must be JSON.

curl -X POST "http://localhost:9010/graph/document/add?process=my_corpus" \
  -H "Content-Type: application/json" \
  -d '{
    "language": "en",
    "documents": [
      {
        "id": "doc-001",
        "name": "example.txt",
        "content": "The document text goes here.",
        "label": "optional-label"
      }
    ],
    "vectorstore_server": {
      "url": "http://localhost",
      "port": 8882
    }
  }'

Request fields:

Field	Type	Required	Description
`language`	string	yes	Document language.
`documents`	array	yes	Documents to add.
`documents[].id`	string	no	External document id.
`documents[].name`	string	yes	Document name.
`documents[].content`	string	yes	Document text.
`documents[].label`	string	no	Optional document label.
`vectorstore_server`	object	no	Vector store connection settings.
`document_server`	object	no	Reserved for document-server based additions.

The endpoint starts an asynchronous document-addition thread.

`GET /graph/document/add/status`

Returns the status or result of a document-addition task.

curl "http://localhost:9010/graph/document/add/status?process=my_corpus"

Possible responses include:

200 OK: task finished and returned a result
202 Accepted: task is still running
404 Not Found: no document-addition task exists for the process

`DELETE /graph/document/<DOCUMENT_ID>`

The path exists internally but document deletion is not implemented.

Process management

`GET /processes`

Returns all known stored processes.

curl "http://localhost:9010/processes"

`GET /status`

Returns the status of a specific process.

curl "http://localhost:9010/status?process=my_corpus"

`GET /processes/{process}/stop`

Requests that a running process be stopped.

curl "http://localhost:9010/processes/my_corpus/stop"

Optional query parameter:

Name	Type	Default	Description
`hard_stop`	boolean	`false`	If `false`, attempts a graceful stop.

`DELETE /processes/{process}/delete`

Deletes a process from the in-memory cache and removes serialized artifacts for finished steps.

curl -X DELETE "http://localhost:9010/processes/my_corpus/delete"

Optional query parameter:

Name	Type	Default	Description
`hard_stop`	boolean	`false`	If the process is running, stop it before deletion.

Document server status

`POST /status/document-server`

Checks whether a configured document server is reachable and contains data.

JSON example:

curl -X POST "http://localhost:9010/status/document-server" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "http://localhost",
    "port": 9008,
    "index": "documents",
    "size": 30
  }'

Multipart example:

curl -X POST "http://localhost:9010/status/document-server" \
  -F document_server_config=@"./document-server-config.yaml"

A typical document server configuration contains:

url: "http://localhost"
port: 9008
index: "documents"
size: 30
other_id: "id"
label_key: "label"
replace_keys:
  text: content

RAG

The API can initialize one active RAG component for a process and answer questions over retrieved document chunks.

`POST /rag/init`

Initializes the RAG component.

curl -X POST "http://localhost:9010/rag/init?process=my_corpus&force=false" \
  -H "Content-Type: application/json" \
  -d '{
    "api_key": "",
    "language": "en",
    "vectorstore_server": {
      "url": "http://localhost",
      "port": 8882
    },
    "chatter": {
      "chatter": "src.rag.chatters.BlabladorChatter.BlabladorChatter"
    },
    "prompt_template": {
      "templates": {
        "en": "Answer the question using the context: {context}\nQuestion: {question}"
      },
      "input_variables": ["context", "question"]
    }
  }'

Query parameters:

Name	Type	Default	Description
`process`	string	`default`	Process name.
`force`	boolean	`false`	Reinitialize/refill the vector-store index even if it already exists.

If the backing chunk vector store is empty, initialization starts a background task to fill it. The RAG component becomes ready after initialization completes.

`GET /status/rag`

Checks whether the active RAG component is ready for a process.

curl "http://localhost:9010/status/rag?process=my_corpus"

`GET /rag/question`

Asks a question using the active RAG component.

curl "http://localhost:9010/rag/question?process=my_corpus&q=What%20is%20this%20corpus%20about%3F"

`POST /rag/question`

Asks a question and optionally restricts retrieval to selected document ids.

curl -X POST "http://localhost:9010/rag/question?process=my_corpus&q=What%20does%20document%20A%20say%3F" \
  -H "Content-Type: application/json" \
  -d '{
    "doc_ids": ["doc-001", "doc-002"],
    "limit": 15
  }'

Request body fields:

Field	Type	Default	Description
`doc_ids`	array of strings	`[]`	Optional list of document ids to restrict retrieval.
`limit`	integer	`15`	Maximum number of document chunks to retrieve.

A successful response contains:

{
  "answer": "...",
  "info": "..."
}

info contains serialized metadata for the retrieved reference documents.

Example configuration files

The recommended way to retrieve a complete current configuration is:

curl "http://localhost:9010/pipeline/configuration?default=true&language=en"

The sections below show the main configuration concepts.

Pipeline JSON configuration

{
  "name": "default",
  "language": "en",
  "document_server": {
    "url": "http://localhost",
    "port": 9008,
    "index": "documents",
    "size": 30,
    "other_id": "id",
    "label_key": "label",
    "replace_keys": {
      "text": "content"
    }
  },
  "vectorstore_server": {
    "url": "http://localhost",
    "port": 8882
  },
  "config": {
    "data": {
      "spacy_model": "en_core_web_trf",
      "n_process": 1,
      "file_extension": "txt",
      "file_encoding": "utf-8",
      "use_lemma": false,
      "prepend_head": false,
      "head_only": false,
      "case_sensitive": false,
      "disable": null,
      "tfidf_filter": {
        "enabled": false,
        "min_df": 1,
        "max_df": 1,
        "stop": null
      },
      "negspacy": {
        "enabled": true,
        "configuration": {
          "scope": 1,
          "language": "en",
          "feat_of_interest": "NC"
        }
      }
    },
    "embedding": {
      "model": "sentence-transformers/paraphrase-albert-small-v2",
      "n_process": 1,
      "storage": {
        "method": "vectorstore",
        "config": {
          "normalizeEmbeddings": false,
          "annParameters": {
            "spaceType": "dotproduct",
            "parameters": {
              "efConstruction": 1024,
              "m": 16
            }
          }
        }
      }
    },
    "clustering": {
      "algorithm": "kmeans",
      "downscale": "umap",
      "missing_as_recommended": true,
      "deduction": {
        "enabled": true,
        "k_min": 2,
        "k_max": 100,
        "n_samples": 15,
        "sample_fraction": 25,
        "regression_poly_degree": 5
      },
      "scaling": {
        "n_neighbors": 10,
        "n_components": 100,
        "min_dist": 0.1
      },
      "clustering": {}
    },
    "graph": {
      "cluster": {
        "distance": 0.7,
        "min_size": 4
      },
      "graph": {
        "cosine_weight": 0.6,
        "merge_threshold": 0.9,
        "graph_weight_cut_off": 0.6,
        "unroll": false,
        "simplify": 0.5,
        "simplify_alg": "significance",
        "sub_clustering": false
      },
      "restrict_to_cluster": true
    }
  }
}

Document server YAML configuration

url: "http://localhost"
port: 9008
index: "documents"
size: 30
other_id: "id"
label_key: "label"
replace_keys:
  text: content

Vector store YAML configuration

url: "http://localhost"
port: 8882

Stored artifacts

Pipeline artifacts are stored below the configured file storage directory, which defaults to:

tmp/

Each process has its own subdirectory.

Depending on configuration and reachable external services, embeddings and integration data may be stored either through the configured vector store or serialized locally.

Common workflow

Start a full pipeline from an uploaded zip file:

curl -X POST "http://localhost:9010/pipeline?process=my_corpus&language=en&return_statistics=false" \
  -F data=@"./documents.zip"

Check progress:

curl "http://localhost:9010/status?process=my_corpus"

Inspect concepts:

curl "http://localhost:9010/clustering/concepts?process=my_corpus&top_k=10"

Inspect graph statistics:

curl "http://localhost:9010/graph/statistics?process=my_corpus"

Fetch a graph:

curl "http://localhost:9010/graph/0?process=my_corpus"

Initialize RAG:

curl -X POST "http://localhost:9010/rag/init?process=my_corpus" \
  -H "Content-Type: application/json" \
  -d @rag-config.json

Ask a question:

curl "http://localhost:9010/rag/question?process=my_corpus&q=Summarize%20the%20main%20topics."

Notes and limitations

Pipeline execution can take a long time for large corpora.
Some operations run asynchronously. Use /status, /processes, /graph/document/add/status, and /status/rag to inspect progress.
Only one active RAG component is held by the application at a time.
Document deletion from graphs is not implemented.
Graph quality depends heavily on corpus size, extracted phrase quality, embeddings, and clustering settings.
Very small corpora may not produce useful concept clusters or graphs.

References

[1] Matthies, F. et al. Concept Graphs: A Novel Approach for Textual Analysis of Medical Documents. In: Röhrig, R. et al., editors. Studies in Health Technology and Informatics. IOS Press; 2023. Available from: https://ebooks.iospress.nl/doi/10.3233/SHTI230710

[2] NetworkX: https://networkx.org/

[3] Dianati, N. Unwinding the hairball graph: Pruning algorithms for weighted complex networks. Physical Review E. 2016;93(1). Available from: https://link.aps.org/doi/10.1103/PhysRevE.93.012304

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
api		api
conf		conf
src		src
templates		templates
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.bak.md		README.bak.md
README.md		README.md
VERSION		VERSION
clustering_util.py		clustering_util.py
docker-compose-local.yml		docker-compose-local.yml
docker-compose-network.yml		docker-compose-network.yml
docker-compose.yml		docker-compose.yml
document_dump.json		document_dump.json
download_models.py		download_models.py
elasticsearch.ipynb		elasticsearch.ipynb
embedding_util.py		embedding_util.py
graph_creation_util.py		graph_creation_util.py
integration_util.py		integration_util.py
load_utils.py		load_utils.py
main.py		main.py
main_methods.py		main_methods.py
main_utils.py		main_utils.py
marqo_test.ipynb		marqo_test.ipynb
preprocessing_util.py		preprocessing_util.py
pyproject.toml		pyproject.toml
test_document_clustering_on_corpus.py		test_document_clustering_on_corpus.py
test_main_utils.py		test_main_utils.py
top-fw-snomed.ipynb		top-fw-snomed.ipynb
top-prod-testing.ipynb		top-prod-testing.ipynb
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Concept Graphs

Requirements

Running locally

Docker

Data model and processes

OpenAPI specification

Pipeline

POST /pipeline

Query parameters

Multipart request

JSON request

Response

Pipeline configuration

GET /pipeline/configuration

Preprocessing inspection

GET /preprocessing/statistics

GET /preprocessing/noun_chunks

Embedding inspection

GET /embedding/statistics

Clustering inspection

GET /clustering/concepts

Graphs

GET /graph/statistics

GET /graph/{graph_id}

Adding documents to existing graphs

POST /graph/document/add

GET /graph/document/add/status

DELETE /graph/document/<DOCUMENT_ID>

Process management

GET /processes

GET /status

GET /processes/{process}/stop

DELETE /processes/{process}/delete

Document server status

POST /status/document-server

RAG

POST /rag/init

GET /status/rag

GET /rag/question

POST /rag/question

Example configuration files

Pipeline JSON configuration

Document server YAML configuration

Vector store YAML configuration

Stored artifacts

Common workflow

Notes and limitations

References

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`POST /pipeline`

`GET /pipeline/configuration`

`GET /preprocessing/statistics`

`GET /preprocessing/noun_chunks`

`GET /embedding/statistics`

`GET /clustering/concepts`

`GET /graph/statistics`

`GET /graph/{graph_id}`

`POST /graph/document/add`

`GET /graph/document/add/status`

`DELETE /graph/document/<DOCUMENT_ID>`

`GET /processes`

`GET /status`

`GET /processes/{process}/stop`

`DELETE /processes/{process}/delete`

`POST /status/document-server`

`POST /rag/init`

`GET /status/rag`

`GET /rag/question`

`POST /rag/question`

Packages