EuropePMC

Europe PMC

Europe PMC is a platform for biomedical articles and associated information, such as entity and relation annotations. These annotations can be supplied by independent text-mining providers, which are required to upload them through a web interface in a specific format. This format is based on JSON lines and works without numerical offsets.

Europe PMC distinguishes two annotation types, each with a variant of the format: sentence-based and entity-based annotations. bconv supports only entity annotations.

Europe PMC restricts the length of the JSON-lines files to 10k lines (ie. 10k documents). For larger collections, multiple jsonl files can be combined in a Zip archive, which can be directly achieved with the europepmc.zip format.

Example

{
  "provider": "bconv",
  "src": "MED",
  "id": "354896",
  "anns": [
    {
      "position": "1.2",
      "prefix": "Lidocaine-induced ",
      "exact": "cardiac asystole",
      "postfix": ".\nIntravenous admini",
      "section": "Title",
      "type": "Disease",
      "tags": [
        {
          "name": "Asystole, cardiac",
          "uri": "D006323"
        }
      ]
    }
  ]
}

Note: The above example is pretty-printed for readability. However, in JSON-lines format, the entire document needs to be written on a single line.

→ Full example

Sources

The official specification of the format is given in the Data Format section of the instructions for submitting annotations by Europe PMC.

Notes

Document structure: The format represents only annotations, but not the text of the underlying document. However, the annotation for each entity references the type and number of the containing section.
Metadata: Each document is defined through an identifier and source abbreviation (eg. "MED" for PubMed/Medline). In addition, Europe PMC requires the registered name of the annotation provider to be specified. The source and provider values can (must) be set through the src and provider options, respectively.
Entity annotations: Every annotation is identified textually through the exact term as well as its left and right context. The length of the context is unspecified, but bconv uses up to 20 characters on each side, such that the context stays within the same sentence. In addition, a (preferred) name and URI need to be given.
Whitespace: Whitespace inside annotated terms is retained unchanged.
Offsets: The format makes no use of offsets. The position of the annotated terms is defined through the given textual context.
Discontinuous spans: Europe PMC's JSON format does not support discontinuous spans. Entities with multiple spans are split into separate entities that are treated like separate annotations in Europe PMC's representation.

Exporters

`EuPMCFormatter`

Properties

fmt	`europepmc`
supports text	no
supports annotations	yes
stream type	text

Options

name	type	default	purpose
provider	str	–	registered provider name
src	str	`'MED'`	source of the article text
meta	Tuple[str, str, str]	`('type', 'pref', 'uri')`	keys in `Entity.metadata`
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions

`EuPMCZipFormatter`

Properties

fmt	`europepmc.zip`
supports text	no
supports annotations	yes
stream type	binary

Options

name	type	default	purpose
provider	str	–	registered provider name
src	str	`'MED'`	source of the article text
meta	Tuple[str, str, str]	`('type', 'pref', 'uri')`	keys in `Entity.metadata`
avoid_gaps	str	`'split'`	suppress discontinuous spans
avoid_overlaps	str	`None`	suppress annotation collisions

bconv Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EuropePMC

Europe PMC

Example

Sources

Notes

Exporters

`EuPMCFormatter`

Properties

Options

`EuPMCZipFormatter`

Properties

Options

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally