Adds Quantile Discretizer by Helw150 · Pull Request #90 · picnicml/doddle-model

Helw150 · 2019-09-21T06:47:56Z

Description of Changes

Adds Quantile Discretization for NumericalFeatures

Includes

Code changes
Tests
Documentation

Helw150 · 2019-09-21T06:50:35Z

Initially made this using IntVector instead of RealVector for the numQuantiles, but for some reason kept getting odd and opaque errors of the following type:

[error]   last tree to typer: Select(Select(Select(Ident(breeze), linalg), DenseVector), fill$mIc$sp)
[error]        tree position: line 44 of /Users/will/oss/doddle-model/src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
[error]             tree tpe: (size: Int, v: () => Int, implicit evidence$5: scala.reflect.ClassTag[Int])breeze.linalg.DenseVector[Int]
[error]               symbol: method fill$mIc$sp in object DenseVector
[error]    symbol definition: def fill$mIc$sp(size: Int, v: () => Int, implicit evidence$5: scala.reflect.ClassTag[Int]): breeze.linalg.DenseVector[Int] (a MethodSymbol)
[error]       symbol package: breeze.linalg
[error]        symbol owners: method fill$mIc$sp -> object DenseVector
[error]            call site: method splitEvenly in object QuantileDiscretizer in package preprocessing

inejc

Hey @Helw150, thanks for opening the PR, this is pretty awesome 🙂. I wrote a couple of comments and suggested a change for the splitEvenly function, let me know what's your thinking about that. Regarding your comment about using IntVector for numQuantiles, I'll let you know once I look at it more thoroughly. The first guess would be to use DenseVector[Int] instead of IntVector as this is just a type alias in doddle-model and it might confuse breeze.

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

Helw150 · 2019-09-21T21:35:10Z

@inejc I think the new version should address all of your comments! Just changing to DenseVector[Int] still had me running into that issue, but I don't have any particular insights that would allow me to debug so lmk if you can figure anything out.

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

inejc

I wrote some suggestions but this is definitely in the right direction! Let me know if you disagree with my comments.

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

src/test/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizerTest.scala

Helw150 · 2019-09-26T03:52:39Z

@inejc Resolved all but one, let me know if it is a sticking point and I will change.

inejc

I believe we are one step away from merging this. Regarding using IntVector for bucketCounts; I simply changed all : DenseVector[Double] to : IntVector, removed all internal DenseVector[Double] types and removed unnecessary .toInt and .toDouble and compilation was successful.

inejc · 2019-09-26T18:20:03Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+import io.picnicml.doddlemodel.typeclasses.Transformer
+import scala.Double.{MaxValue, MinValue}
+
+case class QuantileDiscretizer(


I like this formatting. Is this scalafmt? We need to add a formatter to the project 😅.

Yeah, I ran Scalafmt (but didn't PR my setup of it since I didn't know if you wanted it). It's a one line addition to the plugins file and a configurable settings file. I can make a PR with the setup and you can tune the config to your personal preferences!

I added this issue: #95.

inejc · 2019-09-26T18:22:44Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+
+  /** Create a quantile discretizer which splits data into discrete evenly sized buckets.
+    *
+    * @param bucketCount The number of quantiles desired


I'm nitpicking here but I wouldn't describe bucketCount as The number of quantiles because the number of quantiles is always one less than the number of buckets. From Wikipedia: Quartiles: the three points that divide the data set into four equal groups in descriptive statistics.

inejc · 2019-09-26T18:35:44Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+
+  private def computeQuantiles(target: Seq[Double], bucketCount: Int): Seq[(Double, Double)] = {
+    val binPercentileWidth = 1.0 / bucketCount
+    val targetArray = target.toArray


The reason I wrote the comment about target: Seq[Double] is that as a result, we copy each numerical column twice, instead of just once. The first time it is copied in def fit with x(::, colIndex).toScalaVector and the second time in def computeQuantiles with target.toArray.

The solution would be to change target: Seq[Double] to target: Array[Double] here and then create an array in def fit directly with .toArray which also makes a copy based on this.

Hope this makes sense and I'm not making a mistake reading this.

Got it! I didn't understand the comment but makes sense now

inejc · 2019-09-26T18:52:03Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+        .map(_.toDouble)
+        .map(DescriptiveStats.percentileInPlace(targetArray, _))
+        .sliding(2)
+        .map({case Seq(lowerBound, upperBound) => (lowerBound, upperBound)})


This line can be just .map { case Seq(lowerBound, upperBound) => (lowerBound, upperBound) }.

Oop yeah, I always use parens but it's personal preference

inejc · 2019-09-26T19:14:11Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+
+    override protected def transformSafe(model: QuantileDiscretizer, x: Features): Features = {
+      val xCopy = x.copy
+      model.featureIndex.numerical.columnIndices.zipWithIndex.foreach {


I like this, it's elegant. I personally would format it as (just a subjective preference):

model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { case (colIndex, bucketsIndex) => val buckets = model.quantiles.getOrBreak(bucketsIndex) (0 until xCopy.rows).foreach { rowIndex => xCopy(rowIndex, colIndex) = buckets.indexWhere { case (lowerBound, upperBound) => lowerBound <= xCopy(rowIndex, colIndex) && xCopy(rowIndex, colIndex) <= upperBound }.toDouble } }

inejc · 2019-09-26T19:31:55Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+import io.picnicml.doddlemodel.data.Feature.FeatureIndex
+import io.picnicml.doddlemodel.data.Features
+import io.picnicml.doddlemodel.syntax.OptionSyntax._
+import io.picnicml.doddlemodel.typeclasses.Transformer


Another line between the penultimate import and import scala.Double.{MaxValue, MinValue} (I used Optimize imports in IntelliJ which also reordered some of the imports).

I use Emacs sadly, but I'll open IntelliJ for the import optimization (I have used Scalafix for similar things, but it's a big dependency for something IntelliJ does for free for most folks)

matejklemen · 2019-09-27T14:44:11Z

src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala

+import scala.Double.{MaxValue, MinValue}
+
+case class QuantileDiscretizer(
+  private val bucketCounts: DenseVector[Double],


If I understood correctly, you implemented this with IntVector initially and the tests failed (didn't compile)?

I tried using IntVector here and changing the DenseVector[Double]s to DenseVector[Int]s throughout the code/tests and the tests passed on all three supported versions of Scala for this project (2.11.12, 2.12.9 and 2.13.0). I got some mysterious error once that I can't reproduce anymore, but it went away after deleting the target/ folder and building again.

Could you please check again, deleting target/ if the problem persists? I'm not sure where the problem could be as I'm assuming you are using the dependency versions listed in project/Dependencies

Adds Quantile Discretizer

43d267a

Use Cross Compatability Ordering

c9a7724

inejc requested review from inejc and matejklemen September 21, 2019 09:27

inejc added the awaits review label Sep 21, 2019

inejc requested changes Sep 21, 2019

View reviewed changes

inejc added the enhancement New feature or request label Sep 21, 2019

Helw150 added 2 commits September 21, 2019 14:18

Use Breeze Descriptive Stats and Ranges instead of clean buckets

d6bfb85

Better Transform Syntax and Efficiency

23a95fd

picnicml deleted a comment Sep 21, 2019

Use Options

c75024c

picnicml deleted a comment Sep 21, 2019

inejc reviewed Sep 23, 2019

View reviewed changes

Helw150 added 4 commits September 25, 2019 20:36

Fmt

b308ffb

Example Fix

ecbab4c

Misc Review Resolutions

ea3304d

Format Test

4849dba

picnicml deleted a comment Sep 26, 2019

inejc reviewed Sep 26, 2019

View reviewed changes

matejklemen reviewed Sep 27, 2019

View reviewed changes

Conversation

Helw150 commented Sep 21, 2019

Description of Changes

Includes

Uh oh!

Helw150 commented Sep 21, 2019

Uh oh!

inejc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Helw150 commented Sep 21, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

inejc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Helw150 commented Sep 26, 2019

Uh oh!

inejc left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Helw150 Sep 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

inejc left a comment •

edited

Loading

Helw150 Sep 26, 2019 •

edited

Loading