Entity Extractor is a web API service that scrapes the text body from an URL, then extracts entities and the accompanying sentences.
Click on the Google Cloud button below to deploy this container on a GCP account via Google Cloud Run.
- Click on Cloud Run button above
- Google Cloud will show a popup, asking you if you trust the repo. Tick Trust and select Confirm
- Google Cloud will then show another popup asking if you Authorize GCP API call with ypur credentials. Click Authorize
- Select project to deploy application
- Select region to deploy application
- Google Cloud Run will build the container image and push to your Project registry. The image will then be deployed.
- Note down the web server URL provided by Google Cloud Run
- Clone repo
git clone https://github.com/vincenttzc/entity-extractor.git
- Build docker image
docker build -t entity-extractor .
- Run docker container
docker run --env-file config/.env.dev -p 8080:8080 entity-extractor
When run on Cloud Run, web server URL is provided by Google Cloud Run.
When run locally, web server URL is http://127.0.0.1:8080/.
| Endpoint | Request Type | Description | Example request body | Example response body |
|---|---|---|---|---|
| /extract_entities | POST | Extract the entities and sentences from URL provided then insert in database. Returns unique entities extracted from URL. | {"input_link": "https://en.wikipedia.org/wiki/Betta" } | {"entities": ["Betta", "United Nations"]} |
| /query_all_entities | GET | Query all unique entities in database | NA | {"entities": ["Betta", "United Nations"]} |
| /query_sentences | POST | Query all sentences containing specified entity from database | {"entity": "Betta"} | {"sentences": ["sentence 1 containing Betta", "sentence 2 containing Betta"]} |
Refer to openapi.json for more detailed information of the API
Environment variables can be added to the .env files in the config folder. This allows different environment variables to be used when running the Docker container for different environments.
Application config can be configured in config/config.yaml
To change the database, datasource type (eg. url or object store) and text format (eg. HTML or CSV):
- Create a new class in the respective database/ datasource/ datapipeline folder which inherits from the base class -
DatabaseType,SourceTypeorTextFormat - Import new class in
__init__.py - Add new class in
main.py, add condition to make it configurable throughconfig.yaml. Add new class as input toDatabase,DataSourceorDataPipeline - Modify
config.yamlto use new class
To test the files:
- Clone repo
git clone https://github.com/vincenttzc/entity-extractor.git
- At root directory, execute:
pytest

