Conversation
|
Initially made this using |
inejc
left a comment
There was a problem hiding this comment.
Hey @Helw150, thanks for opening the PR, this is pretty awesome 🙂. I wrote a couple of comments and suggested a change for the splitEvenly function, let me know what's your thinking about that. Regarding your comment about using IntVector for numQuantiles, I'll let you know once I look at it more thoroughly. The first guess would be to use DenseVector[Int] instead of IntVector as this is just a type alias in doddle-model and it might confuse breeze.
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
|
@inejc I think the new version should address all of your comments! Just changing to DenseVector[Int] still had me running into that issue, but I don't have any particular insights that would allow me to debug so lmk if you can figure anything out. |
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
inejc
left a comment
There was a problem hiding this comment.
I wrote some suggestions but this is definitely in the right direction! Let me know if you disagree with my comments.
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizer.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/picnicml/doddlemodel/preprocessing/QuantileDiscretizerTest.scala
Show resolved
Hide resolved
|
@inejc Resolved all but one, let me know if it is a sticking point and I will change. |
There was a problem hiding this comment.
I believe we are one step away from merging this. Regarding using IntVector for bucketCounts; I simply changed all : DenseVector[Double] to : IntVector, removed all internal DenseVector[Double] types and removed unnecessary .toInt and .toDouble and compilation was successful.
| import io.picnicml.doddlemodel.typeclasses.Transformer | ||
| import scala.Double.{MaxValue, MinValue} | ||
|
|
||
| case class QuantileDiscretizer( |
There was a problem hiding this comment.
I like this formatting. Is this scalafmt? We need to add a formatter to the project 😅.
There was a problem hiding this comment.
Yeah, I ran Scalafmt (but didn't PR my setup of it since I didn't know if you wanted it). It's a one line addition to the plugins file and a configurable settings file. I can make a PR with the setup and you can tune the config to your personal preferences!
|
|
||
| /** Create a quantile discretizer which splits data into discrete evenly sized buckets. | ||
| * | ||
| * @param bucketCount The number of quantiles desired |
There was a problem hiding this comment.
I'm nitpicking here but I wouldn't describe bucketCount as The number of quantiles because the number of quantiles is always one less than the number of buckets. From Wikipedia: Quartiles: the three points that divide the data set into four equal groups in descriptive statistics.
|
|
||
| private def computeQuantiles(target: Seq[Double], bucketCount: Int): Seq[(Double, Double)] = { | ||
| val binPercentileWidth = 1.0 / bucketCount | ||
| val targetArray = target.toArray |
There was a problem hiding this comment.
The reason I wrote the comment about target: Seq[Double] is that as a result, we copy each numerical column twice, instead of just once. The first time it is copied in def fit with x(::, colIndex).toScalaVector and the second time in def computeQuantiles with target.toArray.
The solution would be to change target: Seq[Double] to target: Array[Double] here and then create an array in def fit directly with .toArray which also makes a copy based on this.
Hope this makes sense and I'm not making a mistake reading this.
There was a problem hiding this comment.
Got it! I didn't understand the comment but makes sense now
| .map(_.toDouble) | ||
| .map(DescriptiveStats.percentileInPlace(targetArray, _)) | ||
| .sliding(2) | ||
| .map({case Seq(lowerBound, upperBound) => (lowerBound, upperBound)}) |
There was a problem hiding this comment.
This line can be just .map { case Seq(lowerBound, upperBound) => (lowerBound, upperBound) }.
There was a problem hiding this comment.
Oop yeah, I always use parens but it's personal preference
|
|
||
| override protected def transformSafe(model: QuantileDiscretizer, x: Features): Features = { | ||
| val xCopy = x.copy | ||
| model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { |
There was a problem hiding this comment.
I like this, it's elegant. I personally would format it as (just a subjective preference):
model.featureIndex.numerical.columnIndices.zipWithIndex.foreach { case (colIndex, bucketsIndex) =>
val buckets = model.quantiles.getOrBreak(bucketsIndex)
(0 until xCopy.rows).foreach { rowIndex =>
xCopy(rowIndex, colIndex) = buckets.indexWhere { case (lowerBound, upperBound) =>
lowerBound <= xCopy(rowIndex, colIndex) && xCopy(rowIndex, colIndex) <= upperBound
}.toDouble
}
}| import io.picnicml.doddlemodel.data.Feature.FeatureIndex | ||
| import io.picnicml.doddlemodel.data.Features | ||
| import io.picnicml.doddlemodel.syntax.OptionSyntax._ | ||
| import io.picnicml.doddlemodel.typeclasses.Transformer |
There was a problem hiding this comment.
Another line between the penultimate import and import scala.Double.{MaxValue, MinValue} (I used Optimize imports in IntelliJ which also reordered some of the imports).
There was a problem hiding this comment.
I use Emacs sadly, but I'll open IntelliJ for the import optimization (I have used Scalafix for similar things, but it's a big dependency for something IntelliJ does for free for most folks)
| import scala.Double.{MaxValue, MinValue} | ||
|
|
||
| case class QuantileDiscretizer( | ||
| private val bucketCounts: DenseVector[Double], |
There was a problem hiding this comment.
If I understood correctly, you implemented this with IntVector initially and the tests failed (didn't compile)?
I tried using IntVector here and changing the DenseVector[Double]s to DenseVector[Int]s throughout the code/tests and the tests passed on all three supported versions of Scala for this project (2.11.12, 2.12.9 and 2.13.0). I got some mysterious error once that I can't reproduce anymore, but it went away after deleting the target/ folder and building again.
Could you please check again, deleting target/ if the problem persists? I'm not sure where the problem could be as I'm assuming you are using the dependency versions listed in project/Dependencies
Description of Changes
Adds Quantile Discretization for NumericalFeatures
Includes