-
Use a tool of your choice to create a test data set containing at least 1000 tuples of email addresses, phone numbers, and names. The data should show evidence of at least 5 distinct data quality problems per column, which you will address with a data transformation script.
-
Write a transformation script (e.g. Python, R, SAS) to clean up your test data. Each of the 15 problems you identified should be clearly identifiable in the code. The output should be a CSV file.
==================================================================================================================
The test data is located in the data folder
sbt run
The results of the run will appear in the data folder.
There will be two files one is cleaned data, another file is rejected rows.