exceptions while training with gold conll data

Hello,

I encountered a few problems while trying to train a model with the gold standard version of the conll-2012 training set (*_gold_conll).

The first issue occurs during the conversion of certain trees, when some nodes of the trees are deleted but accessed later:
```
 File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 401, in __init__
    [parse.replace("NOPARSE", "S") for parse in parses]#, include_erased=True
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in convert_trees
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in <genexpr>
    for ptb_tree in ptb_trees)
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.py", line 141, in convert_tree
    sentence.renumber()
  File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/CoNLL.py", line 111, in renumber
    for token in self]
KeyError: 18
```
This happens for several sentences in the training data set (e.g., document bn/cnn/04/cnn_0432, sentence on lines 272-296). One way to avoid the exception is to set `include_erased=True`.

The second issue is caused by one sentence in the training set (document mz/sinorama/10/ectb_1005, lines 980-1012): 
```
Traceback (most recent call last):
  File "$HOME/.local/bin/cort-train", line 132, in <module>
    "r", "utf-8"))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
    document_as_strings]))
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
    return documents.CoNLLDocument(string)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 414, in __init__
    super(CoNLLDocument, self).__init__(identifier, sentences, coref)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 97, in __init__
    self.annotated_mentions = self.__get_annotated_mentions()
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 111, in __get_annotated_mentions
    span, self, first_in_gold_entity=set_id not in seen
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mentions.py", line 174, in from_document
    mention_property_computer.compute_gender(attributes)
  File "$HOME/.local/lib/python2.7/site-packages/cort/core/mention_property_computer.py", line 89, in compute_gender
    if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected string, ParentedTree found
```

The problems seem to be data-related, as none of them occur when using the *_auto_conll version of the conll-2012 training data.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exceptions while training with gold conll data #13

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

exceptions while training with gold conll data #13

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions