Training

Command-Line

The following shows the command to train the model for part-of-speech tagging, dependency parsing, named entity recognition, or semantic role labeling: pos, dep, ner, srl.

java edu.emory.clir.clearnlp.bin.NLPTrain -mode <mode> -c <filename> -f <filename> -t <filepath> -d <filepath> [-m <filename> -te <string> -de <string>]

-mode <mode>  : pos|dep|ner|srl
-c <filename> : configuration file (required)
-f <filename> : feature template files (required)
-m <filename> : model filename (optional)
-t <filepath> : training path (required)
-d <filepath> : development path (required)
-te <string>  : training file extension (default: *)
-de <string>  : development file extension (default: *)

Sample configuration files can be found here (see below for more details).
Sample feature template files can be found here (see feature template).
If the model filename -m is specified, the final model is saved.
The tarining or development path -t|d can point to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension -te|de under the specific directory are processed.
The training or development file extensions -te|de specifies the extensions of the training and development files. The default value * implies files with any extension. This option is used only when the training or development path -t|d points to a directory.

The following command takes the training file (wsj_0001.parse.dep), the development file (clearnlp.txt), and the configuration file (config_train_dep.xml), and generates a dependency parsing (dep) model dummy-dep.xz.

$ java -Xmx10g -XX:+UseConcMarkSweepGC java edu.emory.clir.clearnlp.bin.NLPTrain -mode dep -c config_train_sample.xml -f feature_en_dep.xml -t wsj_0001.parse.dep -d clearnlp.txt.cnlp -m dummy-dep.xz

Make sure to use the -XX:+UseConcMarkSweepGC option for JVM, which reduces the memory usage into a half.
Add the log4j configuration file (log4j.properties) to your classpath.

Configuration

The following describes the specifications of the configuration files. See more details about the individual components in their pages.

Element	Description
`<language>`	Specifies the language of the models. See TLanguage for all supported languages.
`<global>`	Specifies the lexicons used globally across different components. `distributional_semantics`: distributional semantics (e.g., brown clusters, word embeddings). `named_entity_dictionary` : named entity dictionary.
`<reader>`	Specifies the data format of the training files.
`<column>`	Specifies the field information. `index` specifies the index of the field, starting at 1. `field` specifies the name of the field. ◦ `id`: node ID. ◦ `form`: word form. ◦ `lemma`: lemma. ◦ `pos`: part-of-speech tag. ◦ `feats`: extra features. ◦ `headId`: head node ID. ◦ `deprel`: dependency label. ◦ `nament`: named entity tag. ◦ `sheads`: semantic heads.
`<trainer>`	Specifies the training algorithm and its parameters. `algorithm`: `adagrad` for AdaGrad, `liblinear`for Liblinear. `type`: `svm` for hinge loss classification, `lr`for logistic regression. `labelCutoff`: count threshold for labels appearing less than `N` times. `featureCutoff`: count threshold for features appearing less than `N` times. `average`: if `true`, apply averaging to online learning. `alpha`: learning rate (AdaGrad). `rho`: ridge to keep the inverse covariance well-conditioned (AdaGrad).
`<bootstraps>`	If `true`, use bootstrap iterations for training sequences.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training

Command-Line

Configuration

FilesExpand file tree

train.md

Latest commit

History

train.md

File metadata and controls

Training

Command-Line

Configuration