Release Notes

Version 3.2.0 (7/13/2015)

The semantic role labeler is added. See models and decoding for details about how to add the SRL model and perform semantic role labeling. From this version, we start using a slightly modified version of semantic role label set, which merges some numbered argument labels with their equivalent modifier tags (see semantic role labels for more details).
The dictionary is updated to 3.2, which now includes derivation rules for English (see models).

Word embedding lexicons are removed from the global lexica, which didn't add much accuracy but took so much RAM space. Furthermore, the gazetteers for named entity recognition are now separated from the global lexica for better modulation (see models for more details).
The core dictionary is updated; some past-tense verbs recognized as base verbs are now fixed.
The named entity recognition model is updated.
See pom.xml for all updated dependencies.

A new component for named entity recognition is added, which shows state-of-the-art accuracy on both CoNLL'03 and OntoNotes data (a paper describing our approach is under submission).
All statistical models are upgraded; the part-of-speech tagger and the dependency parser use features extracted from distributional semantics, which give more robust results on unseen data.
The dependency parser is trained on data from our new dependency coversion adapting many concenpts from the universal dependency structures and introducing some new useful labels such as dative.

ClearNLP is now developed by the Center for Language and Information Research at Emory University.
Our maven group ID is changed from com.clearnlp to edu.emory.clir.
All our repositories are moved from github.com/clearnlp to github.com/clir.
The version 3.0.0 is written from the scratch. All components in this version show significant speed-up over the previous ones (2-3 times), and the statistical models consume less disk and memory space.
Staistical models for general, medical, and bioinformatics domains are provided (see here; the medical and bioinformatics models will be uploaded by March 25th, 2015).
The tokenizer preserves non-UTF8 characters as they are; previously, they were converted to their UTF8 equivalent characters (e.g., smart double quotes to ").
The dependency parser is back to greedy parsing, which makes the model size much smaller (about 18 times less disk space) and much faster (about 10K tokens per second in Intel Xeon CPU) without sacrifying much accuracy (about .5% lower).
This version does not include the semantic role labeler. There have been many changes in PropBank and we decided to spend another month for developing a new semantic role labeler. The semantic role labeler will be ready in May, 2015.
We are preparing a named entity recognizer and a coreference resolution system. These systems will be ready in August, 2015.
Better documentation is provided at our guidelines project for more details about training, decoding, javadoc, etc.