-
Notifications
You must be signed in to change notification settings - Fork 1
Interchange format
Used for inter-module communication.
The best explanation is always by example:
{
'docid': '8808605',
'sections': [
{
'text': 'Somatic-cell selection is a major determinant...',
'offset': 0,
'mentions': [
{
'start': 154,
'end': 171,
'gaps': [],
'text': 'enzyme deficiency',
'type': 'DiseaseClass',
'id': frozenset({'D008661'}),
}
]
},
{
'text': 'X-chromosome inactivation in mammals is regarded...',
'offset': 173,
'mentions': [
{
'start': 203,
'end': 254,
...
},
{
'start': 399,
...
}
]
}
]
}This is the beginning of the dev set of the NCBI disease corpus.
The offsets of a mention are always relative to the start of the section. The document offset can be calculated by adding the section offset.
For any given section sec, the following should hold:
m = sec['mentions'][0]
start, end = m['start'], m['end']
assert sec['text'][start:end] == m['text']The ID is embedded in a complex structure, because it doesn't always have a single value (eg. in the NCBI disease corpus, some IDs map to multiple preferred IDs in the MEDIC terminology).
The structure must be a hashable sequence, such as a tuple or frozenset, or any custom type that has a __contains__ method.
In evaluation, the correctness of a predicted ID is determined through membership test, ie. prediction_id in reference_id.
In the NCBI disease corpus, there are composite and multi-concept mentions which have multiple IDs separated by "|" and "+", respectively. These are parsed into a custom class that takes care of this.
In the ShARe/CLEF corpus, there are non-contiguous spans, ie. the tokens of a mention are interleaved with other text.
Rather than using multiple spans, this representation uses a single string spanning from the first to the last token, masking the interleaved tokens with […] (the characters U+20 U+5B U+2026 U+5D U+20).
The offsets of the gaps are given as a list of <start, end> pairs.
For example, the first report of the training corpus (00098-16139) contains an annotation
00098-016139-DISCHARGE_SUMMARY.txt||Disease_Disorder||C0221755||1141||1148||1192||1198
which corresponds to the first and last word of the sentence
Abdomen is soft, nontender, nondistended, negative bruits.
This is represented as follows:
'mentions': [
{
'start': 1141,
'end': 1198,
'gaps': [(1148, 1192)],
'text': 'abdomen […] bruits',
'type': 'Disease_Disorder',
'id': frozenset({'C0221755'}),
}
]