Skip to content

Additional preprocessors#68

Merged
inejc merged 10 commits intopicnicml:masterfrom
matejklemen:additional-preprocessors
Sep 6, 2019
Merged

Additional preprocessors#68
inejc merged 10 commits intopicnicml:masterfrom
matejklemen:additional-preprocessors

Conversation

@matejklemen
Copy link
Member

@matejklemen matejklemen commented Jul 11, 2019

Description of Changes

Addresses #7. Implements 3 new preprocessors:

  1. Binarizer: convert numerical columns into binary columns based on a threshold. Works with either a Double scalar or a DenseVector[Double]. In the second case, each value of vector represents a threshold for a separate column, e.g. 1st value is used as threshold for 1st numeric column.
    Motivated by sklearn's Binarizer.

  2. Normalizer: converts rows to unit norm. Currently, there are 3 norms supported (L1, L2, max). These should probably be exposed in some kind of distance module for further code reuse, I am open to ideas on where to put these norm functions.
    Motivated by sklearn's Normalizer.

  3. RangeScaler: converts numerical columns to a certain range, e.g. from 0 to 1.
    Motivated by sklearn's MinMaxScaler (RangeScaler or IntervalScaler seems like a more intuitive name to me, though I can change it if min-max scaling is a "standardized" name for this).

Includes
  • Code changes
  • Tests
  • Documentation

@inejc inejc self-requested a review July 12, 2019 07:16
@inejc inejc added the enhancement New feature or request label Jul 12, 2019
@inejc
Copy link
Member

inejc commented Jul 12, 2019

Thanks @matejklemen 👍, will look at the PR as soon as I find time for it.

* swap argument order in private case class to prevent conflicts between auto-generated apply method and custom apply method that is exposed to the public
@picnicml picnicml deleted a comment Jul 12, 2019
@matejklemen
Copy link
Member Author

Alright, no rush.

I've fixed the compilation error that was previously present and tried to address Codacy's issues with the PR, though it seems that I don't understand what the issue "Found unnamed result or parameter" actually means 😅

Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR @matejklemen 🙂. This is definitely in the right direction but requires some small changes. Let me know if my comments aren't clear or you have any further questions.

* no-op instead of exception when no numerical columns in input to Binarizer and RangeScaler
* fix tests accordingly
* expose norms in preprocessing.Norms
* style fixes
@picnicml picnicml deleted a comment Aug 4, 2019
@picnicml picnicml deleted a comment Aug 4, 2019
@matejklemen
Copy link
Member Author

Alright, I think I got all of them. Thanks for the helpful comments 🙂

I was a bit confused about sealed traits, so I did it slightly differently than what you said. It seemed redundant to me to define available Norms twice - once in Normalizer and again in Norms so I just did it in Norms.
Please let me know if that's ok.

Copy link
Member

@inejc inejc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some comments but we are at the doorstep of merging this 😄. Do you want to add docs for the new preprocessors in this PR or in #69 (I can do it too once this one is merged if you don't want to)?

* Avoid unnecessarily copying data in preprocessors (remove toDenseMatrix)
* Make some constructors and values public
* Add docs for RangeScaler, Binarizer, Normalizer
* Add RangeScaler test with different range than [0, 1]
* Optimize imports
@matejklemen
Copy link
Member Author

New batch of changes:

  • added tests for Norms (my bad, completely forgot before 😅 )
  • made minor style fixes
  • added docs for the three preprocessors (I only put the imports that I think are non-obvious in the examples, as I'm doing in Update documentation #69)
  • added a test with different range for RangeScaler
  • removed unnecessary copying that is done by toDenseMatrix

@picnicml picnicml deleted a comment Sep 3, 2019
@inejc
Copy link
Member

inejc commented Sep 4, 2019

@matejklemen everything LGTM except the comment I wrote regarding compilation (a single and simple change). I will merge once this one is pushed 😄.

@picnicml picnicml deleted a comment Sep 5, 2019
@inejc inejc merged commit 65940fe into picnicml:master Sep 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants