Skip to content

Roadmap

Srihari edited this page Aug 2, 2016 · 5 revisions

Implemented Features

###Imputation - Fill in missing values

  • Deletion: Remove records with missing values
  • Dummy substitution: Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values.
  • Mean substitution: If the missing data is numerical, replace the missing values with the mean.
  • Frequent substitution: If the missing data is categorical, replace the missing values with the most frequent item
  • Regression substitution: Use a regression method to replace missing values with regressed values.
  • Classification based substitution for categorical variables

####Type Analyzer

  • Automatically infer data types from datasets
    • Integer
    • Decimal
      • Fraction
      • Percentage
    • Location
      • Lat/Long based geo coordinates
      • Address
      • Zip code
      • Country
      • City
    • Currency
    • Date
    • Timestamp
    • Email
    • URL
    • Phone number
      • Landline
      • Mobile
    • SSN
    • Name

###Data Smoothing

  • Moving average
  • Weighted moving average

###Data Normalization Data normalization re-scales numerical values to a specified range. Popular data normalization methods include:

  • Min-Max Normalization: Linearly transform the data to a range, say between 0 and 1, where the min value is scaled to 0 and max value to 1.
  • Z-score Normalization: Scale data based on mean and standard deviation: divide the difference between the data and the mean by the standard deviation.
  • Decimal scaling: Scale the data by moving the decimal point of the attribute value.

###Data Transformation

  • Merge columns
  • Split columns
  • De-duplication of rows

Clone this wiki locally