-
Notifications
You must be signed in to change notification settings - Fork 3
EuropePMC
Europe PMC is a platform for biomedical articles and associated information, such as entity and relation annotations. These annotations can be supplied by independent text-mining providers, which are required to upload them through a web interface in a specific format. This format is based on JSON lines and works without numerical offsets.
Europe PMC distinguishes two annotation types, each with a variant of the format: sentence-based and entity-based annotations.
bconv supports only entity annotations.
Europe PMC restricts the length of the JSON-lines files to 10k lines (ie. 10k documents).
For larger collections, multiple jsonl files can be combined in a Zip archive, which can be directly achieved with the europepmc.zip format.
{
"provider": "bconv",
"src": "MED",
"id": "354896",
"anns": [
{
"position": "1.2",
"prefix": "Lidocaine-induced ",
"exact": "cardiac asystole",
"postfix": ".\nIntravenous admini",
"section": "Title",
"type": "Disease",
"tags": [
{
"name": "Asystole, cardiac",
"uri": "D006323"
}
]
}
]
}Note: The above example is pretty-printed for readability. However, in JSON-lines format, the entire document needs to be written on a single line.
The official specification of the format is given in the Data Format section of the instructions for submitting annotations by Europe PMC.
- Document structure: The format represents only annotations, but not the text of the underlying document. However, the annotation for each entity references the type and number of the containing section.
-
Metadata: Each document is defined through an identifier and source abbreviation (eg. "MED" for PubMed/Medline).
In addition, Europe PMC requires the registered name of the annotation provider to be specified.
The source and provider values can (must) be set through the
srcandprovideroptions, respectively. -
Entity annotations: Every annotation is identified textually through the exact term as well as its left and right context.
The length of the context is unspecified, but
bconvuses up to 20 characters on each side, such that the context stays within the same sentence. In addition, a (preferred) name and URI need to be given. - Whitespace: Whitespace inside annotated terms is retained unchanged.
- Offsets: The format makes no use of offsets. The position of the annotated terms is defined through the given textual context.
- Discontinuous spans: Europe PMC's JSON format does not support discontinuous spans. Entities with multiple spans are split into separate entities that are treated like separate annotations in Europe PMC's representation.
| fmt | europepmc |
|---|---|
| supports text | no |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| provider | str | – | registered provider name |
| src | str | 'MED' |
source of the article text |
| meta | Tuple[str, str, str] | ('type', 'pref', 'uri') |
keys in Entity.metadata
|
| avoid_gaps | str | 'split' |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |
| fmt | europepmc.zip |
|---|---|
| supports text | no |
| supports annotations | yes |
| stream type | binary |
| name | type | default | purpose |
|---|---|---|---|
| provider | str | – | registered provider name |
| src | str | 'MED' |
source of the article text |
| meta | Tuple[str, str, str] | ('type', 'pref', 'uri') |
keys in Entity.metadata
|
| avoid_gaps | str | 'split' |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |