-
Notifications
You must be signed in to change notification settings - Fork 3
PubTator
Lenz Furrer edited this page May 24, 2021
·
8 revisions
PubTator is an NCBI-hosted online tool for manual annotation which comes with its own format.
The PubTator format is a compact, plain-text-based format for text and text-bound annotations. A limited amount of metadata and text structure is supported (document ID, title/abstract).
An unofficial variant (here called FBK) of the format exists, which uses different fields in the annotations.
354896|t|Lidocaine-induced cardiac asystole.
354896|a|Intravenous administration of a single 50-mg bolus of lidocaine in a 67-year-old man ...
354896 0 9 Lidocaine Chemical D008012
354896 18 34 cardiac asystole Disease D006323
354896 90 99 lidocaine Chemical D008012
The PubTator format is briefly described and illustrated on the PubTator tutorial page. Note: an enhanced version of the PubTator tool, PubTator Central, also supports full-text documents, but only exports them to BioC.
An authoritative source for the FBK variant of the format has yet to be found.
-
Document structure: PubTator was designed for abstracts, which consist of a title and an abstract body only.
In
bconv's implementation, longer documents can be serialised (and also parsed), but this is not recommended, as it is an unofficial extension of the format. PubTator supports collections by separating documents with a blank line. - Metadata: PubTator supports a document ID and a single-letter symbol indicating section type ("t" and "a" for "title" and "abstract", respectively).
-
Entity annotations: The official PubTator format uses the following six fields for representing entity annotations (the last one is optional):
docid,start_offset,end_offset,mention,type,cui. The values fortypeandcuicorrespond to entries in theEntity.metadataattribute. The default keys for lookup inEntity.metadataare"type"and"cui", but this can be changed with themetaoption.
In the FBK variant, the entity fields are:docid,id,type,start_offset,end_offset,mention. Besides re-ordering the fields, the FBK variant does not include a concept identifier, but instead includes a mention ID (typically prefixed with "T"). The value of thetypefield corresponds to an entry inEntity.metadatawith the default key"type", or whatever is specified through themetaoption. - Whitespace: When serialising, line-break characters in the text are replaced with an equal amount of space characters. Whitespace at the end of a section is removed, if present, and a single line-break character is added. During parsing, all whitespace at the end of a section (including the line break required by the format) is regarded as part of the document text.
- Offsets: Character offsets are calculated in Unicode codepoints. During serialisation, if section-final whitespace is removed or added, the offsets are adjusted accordingly, such that they correspond to the accompanying text (but not necessarily to their original value in a different input format).
- Discontinuous spans: PubTator supports only contiguous spans. When serialising to PubTator format, entities with multiple spans are subject to entity flattening. By default, sub-spans are split into separate entities that are treated like individual annotations in PubTator format.
-
Relations/events: PubTator shows support for binary relations between concepts at the document level.
However, this is not trivially converted to and from the relations/events supported by other formats (BioC, Brat, PubAnnotation), which are defined between text-bound entities and other relations.
Therefore, PubTator relations are currently not supported by
bconv.
| fmt | pubtator |
|---|---|
| native type | Collection |
| lazy loading | yes |
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| meta | Tuple[str, str] | ('type', 'cui') |
keys in Entity.metadata for the type and cui fields |
| fmt | pubtator_fbk |
|---|---|
| native type | Collection |
| lazy loading | yes |
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| meta | str | 'type' |
key in Entity.metadata for the type field |
| fmt | pubtator |
|---|---|
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| meta | Tuple[str, str] | ('type', 'cui') |
keys in Entity.metadata for the type and cui fields |
| avoid_gaps | str | 'split' |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |
| fmt | pubtator_fbk |
|---|---|
| supports text | yes |
| supports annotations | yes |
| stream type | text |
| name | type | default | purpose |
|---|---|---|---|
| meta | str | 'type' |
key in Entity.metadata for the type field |
| avoid_gaps | str | 'split' |
suppress discontinuous spans |
| avoid_overlaps | str | None |
suppress annotation collisions |