Hello,
I encountered a few problems while trying to train a model with the gold standard version of the conll-2012 training set (*_gold_conll).
The first issue occurs during the conversion of certain trees, when some nodes of the trees are deleted but accessed later:
File "$HOME/.local/bin/cort-train", line 132, in <module>
"r", "utf-8"))
File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
document_as_strings]))
File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
return documents.CoNLLDocument(string)
File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 401, in __init__
[parse.replace("NOPARSE", "S") for parse in parses]#, include_erased=True
File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in convert_trees
for ptb_tree in ptb_trees)
File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/StanfordDependencies.py", line 116, in <genexpr>
for ptb_tree in ptb_trees)
File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/JPypeBackend.py", line 141, in convert_tree
sentence.renumber()
File "$HOME/.local/lib/python2.7/site-packages/StanfordDependencies/CoNLL.py", line 111, in renumber
for token in self]
KeyError: 18
This happens for several sentences in the training data set (e.g., document bn/cnn/04/cnn_0432, sentence on lines 272-296). One way to avoid the exception is to set include_erased=True.
The second issue is caused by one sentence in the training set (document mz/sinorama/10/ectb_1005, lines 980-1012):
Traceback (most recent call last):
File "$HOME/.local/bin/cort-train", line 132, in <module>
"r", "utf-8"))
File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 79, in from_file
document_as_strings]))
File "$HOME/.local/lib/python2.7/site-packages/cort/core/corpora.py", line 14, in from_string
return documents.CoNLLDocument(string)
File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 414, in __init__
super(CoNLLDocument, self).__init__(identifier, sentences, coref)
File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 97, in __init__
self.annotated_mentions = self.__get_annotated_mentions()
File "$HOME/.local/lib/python2.7/site-packages/cort/core/documents.py", line 111, in __get_annotated_mentions
span, self, first_in_gold_entity=set_id not in seen
File "$HOME/.local/lib/python2.7/site-packages/cort/core/mentions.py", line 174, in from_document
mention_property_computer.compute_gender(attributes)
File "$HOME/.local/lib/python2.7/site-packages/cort/core/mention_property_computer.py", line 89, in compute_gender
if __wordnet_lookup_gender(" ".join(attributes["head"])):
TypeError: sequence item 0: expected string, ParentedTree found
The problems seem to be data-related, as none of them occur when using the *_auto_conll version of the conll-2012 training data.
Hello,
I encountered a few problems while trying to train a model with the gold standard version of the conll-2012 training set (*_gold_conll).
The first issue occurs during the conversion of certain trees, when some nodes of the trees are deleted but accessed later:
This happens for several sentences in the training data set (e.g., document bn/cnn/04/cnn_0432, sentence on lines 272-296). One way to avoid the exception is to set
include_erased=True.The second issue is caused by one sentence in the training set (document mz/sinorama/10/ectb_1005, lines 980-1012):
The problems seem to be data-related, as none of them occur when using the *_auto_conll version of the conll-2012 training data.