-
Notifications
You must be signed in to change notification settings - Fork 7
Work in progress
###Data Transformation
- Append rows
- Append columns
- Derive new column from two or more other columns
- Extract values from a columns and populate a new column with it
###Formula Builder for transformations
- Date Functions
- String Functions
###Blending different data sources
- Setup canonical taxonomy
- Matching tables/rules to match taxonomy elements to the individual datasets
###Outlier Detection
- Simple Box plot/IQR based outlier detection
- 3 IQR - Extreme outliers
- 1.5 IQR - Normal outliers
- Remove outlier rows
###Data Profiling ####Schema validation In many use cases, large amounts of data comes from third-party providers and that data isn’t always 100% reliable. The first line of defense are schemas that allow us to validate if the data provider is sending us the data in the right format—that is, the right number of columns and the data types that go into those columns. The output of the schema validation process must include
####Completeness analyzer
- The completeness analyzer provides a really simple way to check that all required fields in your records have been filled. Think of it like a big "not null" check across multiple fields. In combination with the monitoring application, this analyzer makes it easy to track which records needs additional information.
- Configuration
- Columns to be analyzed
- Evaluation mode - you can configure whether the analyzer should consider records as "incomplete" if any of the selected values are null/blank, or if all the values need to be null/blank before the record is counted as incomplete.
####Boolean Analyzer
- For a single boolean column it is quite simple: It will show the distribution of true/false (and optionally null) values in a column. For several columns it will also show the value combinations and the frequencies of the combinations. The combination matrix makes the Boolean analyzer a handy analyzer for use with combinations of matching transformers and other transformers that yield boolean values.
- Boolean analyzer has no configuration parameters, except for the input columns.
####Character set distribution
- The Character set distribution analyzer inspects and maps text characters according to character set affinity, such as Latin, Hebrew, Cyrillic, Chinese and more.
- Such analysis is convenient for getting insight into the international aspects of your data.
####Date gap analyzer
- Ability to identify gaps in a time series, especially when records represent incremental changes
- It will allow you to identify if there are unexpected gaps in the data.
####Number analyzer
- Ability to get the Min, Max, Sum, and Avg for a numeric column
####Reference data matcher
- The 'Reference data matcher' analyzer provides an easy means to match several columns against several dictionaries and/or several string patterns.
- The result is a matrix of match information for all columns and all matched resources.
####Referential integrity analyzer
- With the 'Referential integrity' analyzer you can check that key relationships between records are intact. The analyzer will work with relationships within a single table, between tables and even between tables of different datastores.
- Configuration
- Cache lookups - Whether or not the analyzer should speed up referential integrity checking by caching previous lookup results. Whether or not this will gain performance ultimately depends on the amount of repetition in the keys to be checked. If all foreign key values are more or less unique, it should definitely be turned off. But if there is a fair amount of duplication in the foreign keys (e.g. order lines referring to the same products or customers), then it makes the lookups faster.
- Ignore null values - Defines whether or not "null" values should be ignored or if they should be considered as an integrity issue. When ignored, all records with null foreign key values will simply be discarded by the analyzer.
####String analyzer
- The string analyzer provides general purpose profiling metrics for string column types. Of special concern to the string analyzer is the amount of words, characters, special signs, diacritics and other metrics that are vital to understanding what kind of string values occur in the data.
####Unique key check
- The 'Unique key check' analyzer provides an easy way to verify that keys/IDs are unique - as it is usually expected.
- Configuration
- Column Pick the column that this analyzer should perform the uniqueness check on.
- Buffer size The buffer represents the internal resource for sorting and comparison of keys. Having a large buffer makes the analyzer run faster and take up fewer resources on disk, but at the expense of using memory. If your job is not already memory intensive, we recommend increasing the buffer size up to 1M.
####Value distribution
- The value distribution (often also referred to as 'Frequency analysis') allows you to identify all the values of a particular column. Furthermore you can investigate which rows pertain to specific values.
- Configuration
- Top n most frequent vales An optional number used if the analysis should only display eg. the "top 5 most frequent values". The result of the analysis will only contain top/bottom n most frequent values, if this property is supplied.
- Bottom n most frequent values An optional number used if the analysis should only display eg. the "bottom 5 most frequent values". The result of the analysis will only contain top/bottom n most frequent values, if this property is supplied.
####Value matcher
- The value matcher works very similar to the Value distribution , except for the fact that it takes a list of expected values and everything else is put into a group of 'unexpected values'. This division of values means a couple of things:
- You get a built-in validation mechanism. You expect maybe only 'M' and 'F' values for your 'gender' column, and everything else is in a sense invalid, since it is unexpected.
- The division makes it easier to monitor specific values in the data quality monitoring web application.
- This analyzer scales much better for large datasets, since the groupings are deterministic and thus can be prepared for in the batch run.
####Weekday distribution
- The weekday distribution provides a frequency analysis for date columns, where you can easily identify which weekdays a date field represents.
####Consistency
- Ability to define and apply a set of rules that determine consistency of the record when applied to each row of the dataset.