The following shows the command to train the model for part-of-speech tagging, dependency parsing, named entity recognition, or semantic role labeling: pos, dep, ner, srl.
java edu.emory.clir.clearnlp.bin.NLPTrain -mode <mode> -c <filename> -f <filename> -t <filepath> -d <filepath> [-m <filename> -te <string> -de <string>]
-mode <mode> : pos|dep|ner|srl
-c <filename> : configuration file (required)
-f <filename> : feature template files (required)
-m <filename> : model filename (optional)
-t <filepath> : training path (required)
-d <filepath> : development path (required)
-te <string> : training file extension (default: *)
-de <string> : development file extension (default: *)
- Sample configuration files can be found here (see below for more details).
- Sample feature template files can be found here (see feature template).
- If the model filename
-mis specified, the final model is saved. - The tarining or development path
-t|dcan point to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension-te|deunder the specific directory are processed. - The training or development file extensions
-te|despecifies the extensions of the training and development files. The default value*implies files with any extension. This option is used only when the training or development path-t|dpoints to a directory.
The following command takes the training file (wsj_0001.parse.dep), the development file (clearnlp.txt), and the configuration file (config_train_dep.xml), and generates a dependency parsing (dep) model dummy-dep.xz.
$ java -Xmx10g -XX:+UseConcMarkSweepGC java edu.emory.clir.clearnlp.bin.NLPTrain -mode dep -c config_train_sample.xml -f feature_en_dep.xml -t wsj_0001.parse.dep -d clearnlp.txt.cnlp -m dummy-dep.xz
- Make sure to use the
-XX:+UseConcMarkSweepGCoption for JVM, which reduces the memory usage into a half. - Add the log4j configuration file (log4j.properties) to your classpath.
The following describes the specifications of the configuration files. See more details about the individual components in their pages.
| Element | Description |
|---|---|
<language> |
Specifies the language of the models.
|
<global> |
Specifies the lexicons used globally across different components.
|
<reader> |
Specifies the data format of the training files. |
<column> |
Specifies the field information.
id: node ID.◦ form: word form.◦ lemma: lemma.◦ pos: part-of-speech tag.◦ feats: extra features.◦ headId: head node ID.◦ deprel: dependency label.◦ nament: named entity tag.◦ sheads: semantic heads. |
<trainer> |
Specifies the training algorithm and its parameters.
|
<bootstraps> |
If true, use bootstrap iterations for training sequences. |