-
Notifications
You must be signed in to change notification settings - Fork 170
Machine Learning Models
BIDMach supports supervised and unsupervised models. It includes a general regression class with linear, logistic and SVM subclasses, and a factor model class with LDA, NMF and SFA (often implemented with ALS or Alternating Least Squares). There are also random forest and k-means. We have started a group of causal estimators in a separate package. Most models can be combined with (e.g. L1 and L2 regularization) mixins. There are typically several options for updaters, although several models use multiplicative updates and require matching updaters.
All models work with either sparse or dense input matrices (on CPU or GPU). For regression models which have target matrices (indicating category membership) as well as data matrices, the target matrices can also be either sparse or dense.
Regression models include Generalized Linear Model (GLM) and Factorization Machine. Linear Regression, Logistic Regression and Support Vector Machines are implemented under GLM with different link functions and loss gradients. Factorization Machine models interaction between features.
Factor models includes Latent Dirichlet Allocation (LDA), Alternating Least Square (ALS) and Non-negative Matrix Factorization (NMF). A factor model takes a sparse matrix (m-by-n) as input and decompose it into two low dimension matrices (m-by-k and k-by-n). k is usually much smaller than m and n. Dimension reduction is typically used to improve prediction, but may also use the factors to capture topic or cluster structure in the dataset.
Usage of the models are described below. Import the following package before using the models:
import BIDMach.models._
The Generalized Linear Model (GLM) is a generalization of linear regression. Linear regression, logistic regression and SVM are implemented under GLM with different link functions and likelihood functions. We will be adding other standard GLM models soon. You should be able to deploy your own GLM models with your own link object that extends BIDMach.models.GLM.GLMLink.
The following code creates a learner and the associated options for GLM.
val (nn, opts) = LDA.learner(a, c, i)Input a is a m-by-n data matrix. It can be either dense or sparse. m is the dimension of features and n is the dimension of examples. c is a k-by-n matrix. k is the number of categories. c(i,j)=1 indicates that example j belongs to category i. For details on how to create matrices, please refer to Data Wrangling or Creating Test Matrices. The choice of GLM model is set by the third argument i. Currently:
0 = linear regression 1 = logistic regression 2 = logistic regression predictor with likelihood (not log likelihood) optimization 3 = Support Vector MachineThe learner builds multiple binary predictors for each category, that is, an OAA or One-Against-All multiclass classifier. To see all the parameters of the learner:
opts.what
To train the model:
nn.train
GLM produces the weight vectors for each of the k categories. To get the results of the learner:
nn.modelmat: the feature-weight matrixTo save the model on disk, please see Saving your results. If you have some data to predict, you can construct a predictor for it like this:
val (nn, nopts) = GLM.predictor(model, atest, ctest, i)
Here model is the model produced in the train step, atest is the test data and again i is the choice of GLM model. ctest is a matrix that holds the predictions.(You can go to here to see how to create some test matrices) You can compute predictions with
nn.predict
Please refer to the usage of GLM, setting the third argument i=0.
Please refer to the usage of GLM, setting the third argument i=1 or 2. i=1 produces standard logistic regression which maximizes the total log likelihood across all input instances. A logistic model outputs a probability <math>p</math> for each test instance, and the probability of correctness is <math>p</math> if the label is 1, and <math>1-p</math> if it is zero. Standard logistic regression optimizes the sum of logs of this probability. A weakness of this formulation is that it attributes a lot of weight to incorrectly labeled instances, trying to avoid large negative log likelihoods. In effect it tries hard not to make bad labels worse.
Alternatively, setting i=2 optimizes the average likelihood across instances. This directly optimizes the accuracy of the prediction (i.e. the probability of a correct prediction). It tries to make good labels better. This option often leads to more accurate models (since it is directly optimizing accuracy). It seems to be a little trickier to tune, but you should find the effort worthwhile if you care about absolute accuracy. Whereas standard logistic regression weights instances according to their prediction error (so bad instances have higher weight), this version weights instances "on the fence" more heavily. Thus it resembles an SVM, and is optimizing accuracy on borderline instances.
Support vector machine learning is very similar. Start by defining a learner with data and target matrices:
> val (nn, nopts) = GLM.SVMlearner(datamat, labelmat)then train it with
> nn.trainThe algorithm uses the Pegasos SGD (Stochastic Gradient) algorithm [1]. This algorithm minimizes the SVM objective function using gradient steps of the form:
<math> \nabla = \lambda w - \mathbb{1}[y_i\;] \:y_i x_i </math>
where <math>w</math> is the model weight vector, <math>x_i</math> is the ith instance, and <math>y_i</math> is the (0,1) label of the ith instance, and <math>\mathbb{1}</math> is an indicator function which is 1 when its argument is true, 0 otherwise. <math>\lambda</math> is the regularization parameter of the SVM (see [1]). The regularization is implemented using BIDMach's L2Regularizer mixin. <math>\lambda</math> is equivalent to the L2 regularizer weight, which can be set with:
> nopts.reg2weight
The section below on Tuning explains how to set the other important parameters.
[1] Shalev-Schwartz S., Singer Y., Srebro, N, "Pegasos: Primal Estimated sub-GrAdient SOlver for SVM", IMCL 2007.
The GLM models are homogeneous, and do not include a constant term. To model a constant term, you need to introduce a constant feature, (usually fixed at 1) into the input data source. Make sure you do this after doing any normalization of each input feature.
The standard datasources include this as an option (Options.addConstFeat). When this option is set, a constant feature is added as the last feature on each instance (so the number of features increments by 1).
Regression models share the following options
var targets:FMat = null var targmap:FMat = null var rmask:FMat = null
and in addition, the GLM model has an option:
var links:IMat = null
which specifies the link (0 for linear, 1 for logistic etc) number for each target. The link argument must be set. The others are optional. They are not needed if the input is two datasources (as in GLM.learner(dat,cats,1) since the targets are given explicitly.
A GLM learner will also typically inherit options from a Learner object (e.g. number of passes over the dataset, regularizers (L1 and possibly L2), and options such as learning rate from an updater such as ADAGRAD. Those options are described in detail in those sections of the Docs. Regularization weights, learning rate, number of passes over the data are all settable in the opts object.
Back to the Regression-specific fields, targets is used to define targets for datasets from which they have not been extracted as separate datasources. For instance, from a set of log files for social media data, you might define targets such as emoticons (:-) which exist as features in the dataset. targets is a k x n matrix where k is the number of targets, and n is the number of features. targets will a a zero-one-valued matrix with ones in position (i,j) where j is a feature which defines target i.
targmap supports multimodel inference over the same set of targets. It will typically be a stack of identity matrices replicated p times, where the learner trains p models per target with different parameters. See the BIDMach/Experiments.scala examples of its usage. For simple cases, you can leave it null.
rmask forces certain elements of the prediction matrix to zero. It is a zero-one-valued matrix. This is essential when you use targets to define targets from input features. At a minimum, the target input features need to be forced to zero in rmask.
You can see a full list of a GLM learner options with the what method, like this:
> val (mm,opts) = GLM.learner(dat,cats,1) > opts.what Option Name Type Value ----------- ---- ----- addConstFeat boolean false batchSize int 10000 dim int 256 doubleScore boolean false epsilon float 1.0E-5 evalStep int 11 featType int 1 initsumsq float 1.0E-5 links IMat 1,1,1,1,1,1,1,1,1,1,... lrate FMat 1 mask FMat null npasses int 1 nzPerColumn int 0 pstep float 0.01 putBack int -1 r1nmats int 1 reg1weight FMat 1.0000e-07 resFile String null rmask FMat null sample float 1.0 sizeMargin float 3.0 startBlock int 8000 targets FMat null targmap FMat null texp FMat 0.50000 useGPU boolean true vexp FMat 0.50000 waitsteps int 2
we wont go over all the options here, and they are discussed in the sections on DataSource, Learner, Mixins and Updaters. There are many, but typically tuning a model will focus on a few: lrate which is the learning rate, the regularizer weights (reg1weight and reg2weight), and the updater parameters texp and vexp which are discussed in those respective sections.
Many of the parameters above have FMat type but a single value. That means all models share the same parameters. But you can tailor the parameters for each target (and use multiple models per target using targmap) by passing in a k-element FMat for any parameter, where k is the total number of targets. Thus you can tune parameters in parallel with a single (or few) passes over a large dataset.
Factorization machines are described in [2]. They use dimension reduction to model interactions in prediction tasks, and have been shown to work well on collaborative filtering and click prediction. An output <math>\hat{y}</math> is modeled as
<math>\hat{y} = w_0 \: + \: \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j</math>
where each <math>\mathbf{\text{v}}_i</math> is a k < n dimensional vector. <math>\hat{y}</math> can be thought of as an approximation to a second-order function <math>y(x)</math> using a dimension reduced interaction matrix. The second summation in the formula above is effectively symmetric and, except for the missing diagonal terms, positive definite. Its been noted that Factorization machines can approximate any second-order function of the features if k is large enough by varying the diagonal weights in the first sum. But this is not necessarily an efficient approximation. In our implementation we generalize the above formula to this one:
<math> \hat{y} = w_0 \: + \sum\limits_{i=1}^n \; w_i x_i \: + \sum\limits_{i=1}^n \sum\limits_{j=i+1}^n \; \langle \mathbf{\text{v}}_i, \mathbf{\text{v}}_j \rangle \; x_i x_j \; - </math><math>\sum\limits_{i=1}^n \sum\limits_{j=i+1}^n\langle\mathbf{\text{u}}_i,\mathbf{\text{u}}_j \rangle x_i x_j</math>
which adds a second sum of negative definite terms. We assume the <math>\mathbf{\text{v}}_i</math> now have dimension <math>k_1</math> and the <math>\mathbf{\text{u}}_i</math> have dimension <math>k_2</math>. This generally gives better predictions for a given total number <math>k=k_1 + k_2</math> of dimensions than using assuming all positive eigenvalues as in the original formulation. We found that <math>k_1 > k_2</math> by a factor of 4 or more works best. Either this is a property of the full second-order matrix, or the positive subspace is more important for high-value predictions.
Factorization machines are integrated with BIDMach's GLM framework, so you can create factorization machines for either linear, logistic or SVM link functions. To create a Factorization machine model, do:
> val (mm, opts) = FM.learner(datamat, labels, d)
where
dis the link number (0 for linear, 1 or 2 for logistic, 3 for SVM). The options for setting the positive and negative subspace dimensions are:
> opts.dim1 = 224 > opts.dim2 = 32
Otherwise, the options for running and tuning GLM models apply.
Note that factorization machines currently only support singe-target predictions. This will probably change soon.
[2] Rendle, S. "Factorization machines." In Proc. 10th IEEE Int. Conf. on Data Mining, 2010.
We currently have two implementations of LDA: BIDMach.models.LDA uses Variational Bayes method for model inference and parameter estimation, and BIDMach.models.LDAgibbs uses Gibbs sampling. The usage of the two models are the same with a few exceptions in setting the model options. The following code creates a learner and the associated options for LDA.
val (nn, opts) = LDA.learner(a, k) or val (nn, opts) = LDAgibbs.learner(a, k)The required input a is m-by-n sparse matrix (word-doc-counts). m is the dimension of features and n is the dimension of examples. For topic model, m is number of word and n is number of document. Input k is optional and is the dimension of the model. For details on how to create a sparse matrix, please refer to Data Wrangling (Directly create sparse matrix from a structured file) or Creating some Matrices (manually create it from Scala arrays).
To see all the parameters of the learner:
opts.what
To train the model:
nn.train
To get the results of the learner:
nn.modelmats(0): the topic-word matrix, one topic per row. nn.datamats(1): the factorized topic-document matrix, one document per column.Hint: Make sure that opts.putBack is set to 1 if you want to store the topic-document matrix in nn.datamats(1). Since for large scale dataset, this is not always the default option.
You can go to Saving your results to see how to save your results.
Here are the key parameters for LDA and LDAgibbs. Default values are noted in parenthesis.
dim(256): Model dimension uiter(5): Number of iterations on one block of data alpha(0.001f): Dirichlet document-topic prior * Other key parameters inherited from the learner, datasource and updater: blockSize: the number of samples processed in a block power(0.3f): the exponent of the moving average of online learning npasses(10): number of complete passes over the dataset
Specialized parameters for LDA
exppsi(true): Apply exp(psi(X)) if true, otherwise just use X
Specialized parameters for LDAgibbs
beta(0.0001f): Dirichlet word-topic prior m(100): number of samples (per word) taken for each pass through the data
Sparse Factor Analysis is a popular model for collaborative filtering. It is often referred to as ALS (Alternating Least Squares) which is an algorithm rather than a model. ALS is actually a very slow way to implement SFA. We use instead a gradient method with CG solver (should be preconditioned CG soon). Construction of SFA models is similar to LDA. For example, you can create a SFA learner with
val (nn, opts) = SFA.learner(a, k)The results are also in the same format as in
LDA.
The usage of NMF model is the same as LDA, except that you need to replace model name LDA with NMF. For example, you can create a NMF learner with
val (nn, opts) = NMF.learner(a, k)The results are also in the same format as in
LDA.
The usage of KMeans is the similar as LDA. For example, you can create a KMeans learner with
val (nn, opts) = KMeans.learner(a, k)Where k is the number of clusters. And if your have n data points in d dimension, then the data matrix a should be a d*n matrix.
Result matrices are almost always stored as GMat (useGPU=true) or FMat (useGPU=false), and have general (Mat) type. Its easier to work with them as FMats (and this facilitates saving). The result of a training session is stored in the Learner's modelmat or modelmats field (for models with several components). You can extract and convert them to FMat like this:
> val (nn,opts) = LDA.learner(a,256) > nn.train > ... > val m = FMat(nn.modelmat)
After that, you can use saveAs("filename.mat",m,"model") to save your results in a Matlab-compatible HDF5 format mat file, or use saveFMat("filename.fmat.lz4",m) to save in BIDMat's own format using lz4 compression. You will find the latter to be significantly faster. We strongly recommend you use a similar file naming convention (".mat" suffix for matlab-compatible files, and "name.mattype.compresstype" for BIDMat-format files. HDF5 can store multiple arrays in each file, and requires a string name for each (in this case, we named m as "model" in the HDF5 file). BIDMat requires matrices to each be saved in their own file. There is type meta-data in the file but it is easier to track matrix types and contents using filenames. (See here for more details).
Creating random matrices:
val a=rand(n,m) //Creating a random FMat val a=grand(n,m) //Creating a random GMat val a=sprand(n,m,0.2) //Creating a random SMat with 20% non-zero values
Creating FMats:
val a=zeros(n,m) //Filled with zeros val a=ones(n,m) //Filled with ones val d=new Array[Float](n*m) val a=new FMat(n,m,d) //Create FMat from existing Scala array
Creating GMats:
val a=gzeros(n,m) //Filled with zeros val a=gones(n,m) //Filled with ones val d=zeros(n,m) val a=GMat(d) //Creating GMat from FMat
Creating IMats:
val a=izeros(n,m) val a=iones(n,m)
Creating SMats:
val rows = izeros(nnz,1) val cols = izeros(nnz,1) val dat = zeros(nnz,1) // Set row and column indices (zero-based) and fill data matrix. val m=sparse(rows,cols,dat)