Skip to content

Latest commit

 

History

History
46 lines (35 loc) · 4.75 KB

File metadata and controls

46 lines (35 loc) · 4.75 KB

Training

Command-Line

The following shows the command to train the model for part-of-speech tagging, dependency parsing, named entity recognition, or semantic role labeling: pos, dep, ner, srl.

java edu.emory.clir.clearnlp.bin.NLPTrain -mode <mode> -c <filename> -f <filename> -t <filepath> -d <filepath> [-m <filename> -te <string> -de <string>]

-mode <mode>  : pos|dep|ner|srl
-c <filename> : configuration file (required)
-f <filename> : feature template files (required)
-m <filename> : model filename (optional)
-t <filepath> : training path (required)
-d <filepath> : development path (required)
-te <string>  : training file extension (default: *)
-de <string>  : development file extension (default: *)
  • Sample configuration files can be found here (see below for more details).
  • Sample feature template files can be found here (see feature template).
  • If the model filename -m is specified, the final model is saved.
  • The tarining or development path -t|d can point to either a file or a directory. When the path points to a file, only the specific file is processed. When the path points to a directory, all files with the file extension -te|de under the specific directory are processed.
  • The training or development file extensions -te|de specifies the extensions of the training and development files. The default value * implies files with any extension. This option is used only when the training or development path -t|d points to a directory.

The following command takes the training file (wsj_0001.parse.dep), the development file (clearnlp.txt), and the configuration file (config_train_dep.xml), and generates a dependency parsing (dep) model dummy-dep.xz.

$ java -Xmx10g -XX:+UseConcMarkSweepGC java edu.emory.clir.clearnlp.bin.NLPTrain -mode dep -c config_train_sample.xml -f feature_en_dep.xml -t wsj_0001.parse.dep -d clearnlp.txt.cnlp -m dummy-dep.xz

Configuration

The following describes the specifications of the configuration files. See more details about the individual components in their pages.

Element Description
<language> Specifies the language of the models.
<global> Specifies the lexicons used globally across different components.
  • distributional_semantics: distributional semantics (e.g., brown clusters, word embeddings).
  • named_entity_dictionary : named entity dictionary.
<reader> Specifies the data format of the training files.
<column> Specifies the field information.
  • index specifies the index of the field, starting at 1.
  • field specifies the name of the field.
  • id: node ID.
    form: word form.
    lemma: lemma.
    pos: part-of-speech tag.
    feats: extra features.
    headId: head node ID.
    deprel: dependency label.
    nament: named entity tag.
    sheads: semantic heads.
<trainer> Specifies the training algorithm and its parameters.
  • algorithm: adagrad for AdaGrad, liblinearfor Liblinear.
  • type: svm for hinge loss classification, lrfor logistic regression.
  • labelCutoff: count threshold for labels appearing less than N times.
  • featureCutoff: count threshold for features appearing less than N times.
  • average: if true, apply averaging to online learning.
  • alpha: learning rate (AdaGrad).
  • rho: ridge to keep the inverse covariance well-conditioned (AdaGrad).
<bootstraps> If true, use bootstrap iterations for training sequences.