-
Notifications
You must be signed in to change notification settings - Fork 7
Roadmap
Srihari edited this page Aug 2, 2016
·
5 revisions
###Imputation - Fill in missing values
- Deletion: Remove records with missing values
- Dummy substitution: Replace missing values with a dummy value: e.g, unknown for categorical or 0 for numerical values.
- Mean substitution: If the missing data is numerical, replace the missing values with the mean.
- Frequent substitution: If the missing data is categorical, replace the missing values with the most frequent item
- Regression substitution: Use a regression method to replace missing values with regressed values.
- Classification based substitution for categorical variables
####Type Analyzer
- Automatically infer data types from datasets
- Integer
- Decimal
- Fraction
- Percentage
- Location
- Lat/Long based geo coordinates
- Address
- Zip code
- Country
- City
- Currency
- Date
- Timestamp
- URL
- Phone number
- Landline
- Mobile
- SSN
- Name
###Data Smoothing
- Moving average
- Weighted moving average
###Data Normalization Data normalization re-scales numerical values to a specified range. Popular data normalization methods include:
- Min-Max Normalization: Linearly transform the data to a range, say between 0 and 1, where the min value is scaled to 0 and max value to 1.
- Z-score Normalization: Scale data based on mean and standard deviation: divide the difference between the data and the mean by the standard deviation.
- Decimal scaling: Scale the data by moving the decimal point of the attribute value.
###Data Transformation
- Merge columns
- Split columns
- De-duplication of rows