PROTEA

Protein annotation platform for large-scale GO term prediction, sequence embedding, and functional analysis.

PROTEA provides a unified backend for ingesting protein data from UniProt, computing protein language model embeddings (ESMC, ProstT5, ESM2), and predicting Gene Ontology terms via KNN transfer plus a learned LightGBM re-ranker — with a full job queue, REST API, and web interface.

Live demo

https://protea.ngrok.app

Currently running on a personal research machine. Availability is best-effort — if it is unreachable, use the Docker setup below to run your own instance.

Why PROTEA?

PROTEA is the successor to PIS and FANTASIA, rebuilt around three goals:

Clean architecture — infrastructure, orchestration, and domain logic are explicitly decoupled. Operations are pure domain logic; workers own sessions and queue state; routers expose HTTP. No more God-classes that mix everything.
Learned re-ranking on top of KNN transfer — beyond classical embedding-KNN annotation, PROTEA trains LightGBM rerankers on temporal GOA splits (LambdaRank + CAFA IA weighting, per-tier NK/LK/PK models). Candidates retrieved by KNN are re-scored with alignment, taxonomy, and retrieval features.
Honest temporal evaluation — benchmarking uses temporal holdout deltas between historical GOA releases (e.g. 220→229), evaluated with the official cafaeval library and information-accretion weighting, avoiding the optimistic leakage of random splits.

What PROTEA does

Capability	Details
Protein ingestion	Paginated UniProt REST API, MD5-deduplicated sequences
GO ontology	Load OBO snapshots, full DAG stored per release
GO annotations	Bulk import from GOA (GAF) and QuickGO (TSV)
Embeddings	ESMC, ProstT5, and ESM2 backends via GPU workers; stored as pgvector `VECTOR` columns
GO prediction	KNN transfer (FAISS IVFFlat / numpy) with optional NW/SW alignment and taxonomic features
Learning-to-rank	LightGBM rerankers trained on temporal GOA splits — LambdaRank + IA weighting, per-tier NK/LK/PK models
CAFA evaluation	Benchmark pipeline with `cafaeval` integration, Fmax + IA-weighted scoring, per-aspect (BPO/MFO/CCO) results
Job queue	RabbitMQ-backed, 8 queues (ingestion, embeddings, predictions, training), full audit trail per job
REST API	FastAPI routers for jobs, proteins, embeddings, query sets, scoring, evaluation, and admin
Web UI	Next.js frontend with protein explorer, annotation viewer, prediction browser, and live job widget

Getting started

Docker

Not yet validated. The Docker configuration exists but has not been tested end-to-end. It will likely need adjustments before it works out of the box — contributions welcome.

git clone https://github.com/frapercan/PROTEA.git
cd PROTEA
docker compose up

Services available at:

Frontend: http://localhost:3000
API: http://localhost:8000
RabbitMQ management: http://localhost:15672 (guest/guest)

From source (recommended)

Requirements: Python 3.12, PostgreSQL 16 + pgvector, RabbitMQ 3.x

git clone https://github.com/frapercan/PROTEA.git
cd PROTEA

poetry install

cp protea/config/system.yaml.example protea/config/system.yaml
# Edit system.yaml: set DB and AMQP URLs

poetry run python scripts/init_db.py
bash scripts/manage.sh start

5 minutes to your first job

With the stack running locally, you can submit a job and watch it move through the queue + worker + DB lifecycle in under 5 minutes.

# 1. Submit a `ping` job (the smoke-test operation).
JOB_ID=$(curl -s -X POST http://localhost:8000/jobs \
  -H 'content-type: application/json' \
  -d '{"operation": "ping", "queue_name": "protea.ping", "payload": {}}' \
  | jq -r '.id')
echo "queued: $JOB_ID"

# 2. Tail the structured-event log until the job reaches a terminal state.
curl -s "http://localhost:8000/jobs/$JOB_ID/events" | jq -c '.[]'
# {"event":"ping.start","fields":null,"level":"info","ts":"..."}
# {"event":"ping.done","fields":{"latency_ms":1.2},"level":"info","ts":"..."}

# 3. Check the final job row + result.
curl -s "http://localhost:8000/jobs/$JOB_ID" | jq '{status, result, error_code}'
# {"status":"succeeded","result":{"echo":"pong"},"error_code":null}

That round-trip exercises the full machinery: HTTP enqueue → AMQP publish → worker claim → operation execute → JobEvent stream → DB commit → REST query. Real operations (insert_proteins, load_goa_annotations, compute_embeddings, predict_go_terms) are submitted the same way; their payloads are documented at /docs (Swagger UI) and in the operation-catalog page of the Sphinx docs.

Discovering the installed plugins (added in F2B turn 36):

curl -s http://localhost:8000/backends | jq '.plugins[].name'
# "ankh", "esm", "esm3c", "t5"

curl -s http://localhost:8000/sources | jq '.plugins[].name'
# "goa", "quickgo", "uniprot"

curl -s http://localhost:8000/runners | jq '.plugins[].name'
# "baseline", "knn", "lightgbm"

Documentation

Full documentation at https://protea.readthedocs.io

Topics covered: architecture, data model, operations, job lifecycle, deployment, how-to guides.

Contributing

Contributions from research institutions and individual developers are welcome. See CONTRIBUTING.md for the branching strategy and development workflow.

Requirements: Python 3.12, Docker (for integration tests)

poetry install
poetry run pytest              # unit tests
poetry run pytest --with-postgres  # integration tests
poetry run task lint           # ruff + flake8 + mypy

Stack

Component	Technology
API	FastAPI + SQLAlchemy 2.x + PostgreSQL 16 + pgvector
Queue	RabbitMQ (pika)
Embeddings	ESMC (ESM SDK), ProstT5 / prot_t5_xl (T5Encoder), ESM2 (Hugging Face Transformers)
KNN search	FAISS IVFFlat / numpy (chunked brute-force)
Re-ranker	LightGBM (LambdaRank, IA-weighted samples)
Frontend	Next.js 19 + Tailwind v4
Deployment	Docker Compose, `scripts/manage.sh` process supervisor

License

Released into the public domain under the Unlicense. You are free to copy, modify, publish, use, compile, sell, or distribute PROTEA for any purpose, commercial or non-commercial, without attribution.

Acknowledgements

PROTEA is the natural evolution of two prior systems developed at Ana Rojas' Lab (CBBIO), Andalusian Center for Developmental Biology (CSIC), in collaboration with Rosa Fernández's Lab (Metazoa Phylogenomics Lab, Institute of Evolutionary Biology, CSIC-UPF):

Protein Information System (PIS) — Large-scale protein data extraction and management from UniProt, PDB, and GOA. PROTEA adopts and extends PIS's data model and ingestion pipelines with a clean architecture designed for scalability and collaborative development.
FANTASIA — Functional annotation via protein language model embeddings and KNN transfer. PROTEA consolidates FANTASIA's prediction capabilities into a unified platform with a web interface, job queue, and REST API.

PROTEA was designed to unify and supersede both systems under a single, maintainable codebase — removing the tight coupling between infrastructure, orchestration, and domain logic that accumulated across those projects.

The evaluation pipeline and scoring methodology are directly informed by following the CAFA (Critical Assessment of protein Function Annotation) competition series. This benchmarking framework shaped PROTEA's prediction and evaluation architecture, including the integration of cafaeval for standardised GO term prediction assessment.

Name		Name	Last commit message	Last commit date
Latest commit History 165 Commits
.baseline		.baseline
.github/workflows		.github/workflows
alembic		alembic
apps		apps
data/benchmarks		data/benchmarks
deploy/grafana		deploy/grafana
docker		docker
docs		docs
protea		protea
scripts		scripts
secrets		secrets
tests		tests
.dockerignore		.dockerignore
.env		.env
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.release-please-manifest.json		.release-please-manifest.json
.sops.yaml		.sops.yaml
.~lock.EXPERIMENTS.md#		.~lock.EXPERIMENTS.md#
.~lock.RERANKER.md#		.~lock.RERANKER.md#
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
EXPERIMENTAL_DESIGN.md		EXPERIMENTAL_DESIGN.md
EXPERIMENTS.md		EXPERIMENTS.md
LICENSE		LICENSE
README.md		README.md
RERANKER.md		RERANKER.md
alembic.ini		alembic.ini
codecov.yml		codecov.yml
docker-compose.monitoring.yml		docker-compose.monitoring.yml
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
release-please-config.json		release-please-config.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PROTEA

Live demo

Why PROTEA?

What PROTEA does

Getting started

Docker

From source (recommended)

5 minutes to your first job

Documentation

Contributing

Stack

License

Acknowledgements

About

Uh oh!

Releases 9

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PROTEA

Live demo

Why PROTEA?

What PROTEA does

Getting started

Docker

From source (recommended)

5 minutes to your first job

Documentation

Contributing

Stack

License

Acknowledgements

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages