Skip to content

Interchange format

Lenz Furrer edited this page Aug 20, 2018 · 5 revisions

Internal document representation

Used for inter-module communication.

Spec

The best explanation is always by example:

{
 'docid': '8808605',
 'sections': [
   {
     'text': 'Somatic-cell selection is a major determinant...',
     'offset': 0,
     'mentions': [
       {
         'start': 154,
         'end': 171,
         'gaps': [],
         'text': 'enzyme deficiency',
         'type': 'DiseaseClass',
         'id': frozenset({'D008661'}),
       }
     ]
   },
   {
     'text': 'X-chromosome inactivation in mammals is regarded...',
     'offset': 173,
     'mentions': [
       {
         'start': 203,
         'end': 254,
         ...
       },
       {
         'start': 399,
         ...
       }
     ]
   }
 ]
}

This is the beginning of the dev set of the NCBI disease corpus.

Offsets

The offsets of a mention are always relative to the start of the section. The document offset can be calculated by adding the section offset.

For any given section sec, the following should hold:

m = sec['mentions'][0]
start, end = m['start'], m['end']
assert sec['text'][start:end] == m['text']

IDs

The ID is embedded in a complex structure, because it doesn't always have a single value (eg. in the NCBI disease corpus, some IDs map to multiple preferred IDs in the MEDIC terminology). The structure must be a hashable sequence, such as a tuple or frozenset, or any custom type that has a __contains__ method. In evaluation, the correctness of a predicted ID is determined through membership test, ie. prediction_id in reference_id.

Complex concepts

In the NCBI disease corpus, there are composite and multi-concept mentions which have multiple IDs separated by "|" and "+", respectively. These are parsed into a custom class that takes care of this.

Non-contiguous mentions (gaps)

In the ShARe/CLEF corpus, there are non-contiguous spans, ie. the tokens of a mention are interleaved with other text. Rather than using multiple spans, this representation uses a single string spanning from the first to the last token, masking the interleaved tokens with […] (the characters U+20 U+5B U+2026 U+5D U+20). The offsets of the gaps are given as a list of <start, end> pairs.

For example, the first report of the training corpus (00098-16139) contains an annotation

00098-016139-DISCHARGE_SUMMARY.txt||Disease_Disorder||C0221755||1141||1148||1192||1198

which corresponds to the first and last word of the sentence

Abdomen is soft, nontender, nondistended, negative bruits.

This is represented as follows:

     'mentions': [
       {
         'start': 1141,
         'end': 1198,
         'gaps': [(1148, 1192)],
         'text': 'abdomen […] bruits',
         'type': 'Disease_Disorder',
         'id': frozenset({'C0221755'}),
       }
     ]

Clone this wiki locally