Skip to content

PubTator

Lenz Furrer edited this page May 24, 2021 · 8 revisions

PubTator

PubTator is an NCBI-hosted online tool for manual annotation which comes with its own format.

The PubTator format is a compact, plain-text-based format for text and text-bound annotations. A limited amount of metadata and text structure is supported (document ID, title/abstract).

An unofficial variant (here called FBK) of the format exists, which uses different fields in the annotations.

Example

354896|t|Lidocaine-induced cardiac asystole.
354896|a|Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man ...
354896	0	9	Lidocaine	Chemical	D008012
354896	18	34	cardiac asystole	Disease	D006323
354896	90	99	lidocaine	Chemical	D008012

Full example

Sources

The PubTator format is briefly described and illustrated on the PubTator tutorial page. Note: an enhanced version of the PubTator tool, PubTator Central, also supports full-text documents, but only exports them to BioC.

An authoritative source for the FBK variant of the format has yet to be found.

Notes

  • Document structure: PubTator was designed for abstracts, which consist of a title and an abstract body only. In bconv's implementation, longer documents can be serialised (and also parsed), but this is not recommended, as it is an unofficial extension of the format. PubTator supports collections by separating documents with a blank line.
  • Metadata: PubTator supports a document ID and a single-letter symbol indicating section type ("t" and "a" for "title" and "abstract", respectively).
  • Entity annotations: The official PubTator format uses the following six fields for representing entity annotations (the last one is optional): docid, start_offset, end_offset, mention, type, cui. The values for type and cui correspond to entries in the Entity.metadata attribute. The default keys for lookup in Entity.metadata are "type" and "cui", but this can be changed with the meta option.
    In the FBK variant, the entity fields are: docid, id, type, start_offset, end_offset, mention. Besides re-ordering the fields, the FBK variant does not include a concept identifier, but instead includes a mention ID (typically prefixed with "T"). The value of the type field corresponds to an entry in Entity.metadata with the default key "type", or whatever is specified through the meta option.
  • Whitespace: When serialising, line-break characters in the text are replaced with an equal amount of space characters. Whitespace at the end of a section is removed, if present, and a single line-break character is added. During parsing, all whitespace at the end of a section (including the line break required by the format) is regarded as part of the document text.
  • Offsets: Character offsets are calculated in Unicode codepoints. During serialisation, if section-final whitespace is removed or added, the offsets are adjusted accordingly, such that they correspond to the accompanying text (but not necessarily to their original value in a different input format).
  • Discontinuous spans: PubTator supports only contiguous spans. When serialising to PubTator format, entities with multiple spans are subject to entity flattening. By default, sub-spans are split into separate entities that are treated like individual annotations in PubTator format.
  • Relations/events: PubTator shows support for binary relations between concepts at the document level. However, this is not trivially converted to and from the relations/events supported by other formats (BioC, Brat, PubAnnotation), which are defined between text-bound entities and other relations. Therefore, PubTator relations are currently not supported by bconv.

Loaders

PubTatorLoader

Properties

fmt pubtator
native type Collection
lazy loading yes
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
meta Tuple[str, str] ('type', 'cui') keys in Entity.metadata for the type and cui fields

PubTatorFBKLoader

Properties

fmt pubtator_fbk
native type Collection
lazy loading yes
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
meta str 'type' key in Entity.metadata for the type field

Exporters

PubTatorFormatter

Properties

fmt pubtator
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
meta Tuple[str, str] ('type', 'cui') keys in Entity.metadata for the type and cui fields
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions

PubTatorFBKFormatter

Properties

fmt pubtator_fbk
supports text yes
supports annotations yes
stream type text

Options

name type default purpose
meta str 'type' key in Entity.metadata for the type field
avoid_gaps str 'split' suppress discontinuous spans
avoid_overlaps str None suppress annotation collisions

Clone this wiki locally