Skip to content

EuropePMC

Lenz Furrer edited this page May 24, 2021 · 7 revisions

Europe PMC

Europe PMC is a platform for biomedical articles and associated information, such as entity and relation annotations. These annotations can be supplied by independent text-mining providers, which are required to upload them through a web interface in a specific format. This format is based on JSON lines and works without numerical offsets.

Europe PMC distinguishes two annotation types, each with a variant of the format: sentence-based and entity-based annotations. bconv supports only entity annotations.

Europe PMC restricts the length of the JSON-lines files to 10k lines (ie. 10k documents). For larger collections, multiple jsonl files can be combined in a Zip archive, which can be directly achieved with the europepmc.zip format.

Example

{
  "provider": "bconv",
  "src": "MED",
  "id": "354896",
  "anns": [
    {
      "position": "1.2",
      "prefix": "Lidocaine-induced ",
      "exact": "cardiac asystole",
      "postfix": ".\nIntravenous admini",
      "section": "Title",
      "type": "Disease",
      "tags": [
        {
          "name": "Asystole, cardiac",
          "uri": "D006323"
        }
      ]
    }
  ]
}

Note: The above example is pretty-printed for readability. However, in JSON-lines format, the entire document needs to be written on a single line.

Full example

Sources

The official specification of the format is given in the Data Format section of the instructions for submitting annotations by Europe PMC.

Notes

  • Document structure: The format represents only annotations, but not the text of the underlying document. However, the annotation for each entity references the type and number of the containing section.
  • Metadata: Each document is defined through an identifier and source abbreviation (eg. "MED" for PubMed/Medline). In addition, Europe PMC requires the registered name of the annotation provider to be specified. The source and provider values can (must) be set through the src and provider options, respectively.
  • Entity annotations: Every annotation is identified textually through the exact term as well as its left and right context. The length of the context is unspecified, but bconv uses up to 20 characters on each side, such that the context stays within the same sentence. In addition, a (preferred) name and URI need to be given.
  • Whitespace: Whitespace inside annotated terms is retained unchanged.
  • Offsets: The format makes no use of offsets. The position of the annotated terms is defined through the given textual context.
  • Discontinuous spans: Europe PMC's JSON format does not support discontinuous spans. Entities with multiple spans are split into separate entities that are treated like separate annotations in Europe PMC's representation.

Exporters

EuPMCFormatter

Properties

fmt europepmc
supports text no
supports annotations yes
stream type text

Options

name type default purpose
provider str registered provider name
src str 'MED' source of the article text
meta Tuple[str, str, str] ('type', 'pref', 'uri') keys in Entity.metadata
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions

EuPMCZipFormatter

Properties

fmt europepmc.zip
supports text no
supports annotations yes
stream type binary

Options

name type default purpose
provider str registered provider name
src str 'MED' source of the article text
meta Tuple[str, str, str] ('type', 'pref', 'uri') keys in Entity.metadata
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions

Clone this wiki locally