-
Notifications
You must be signed in to change notification settings - Fork 1
Quick Start Guide
dbaAlex edited this page Feb 14, 2013
·
4 revisions
Still a work in progress, but this page will serve as a quick start to help to get the Automated Profiling Tool up and Running.
- xmlstarlet - for parsing xml files (using Windows? Install cygwin http://www.cygwin.com/ to use xmlstarlet. This requirement will go away with Data Cleaner 3.x support)
- Data Cleaner 2.x - 3.x is not yet fully supported. Will be coming in the next major release.
- Pentaho Data Integrator (aka Kettle) 4.3 or higher
- MySQL 5.1 or higher
#Setup
- In the database folder, there are two SQL scripts. First run profile_schema.sql from the mysql command line. Next, run the profile_unknown_record.sql
- Now that the profile tool's database is set up we can continue configuration. The profile_customization folder holds two spreadsheets, one for data source information and the other for source query information. For now, open the profile_data_source_list.xls and fill in a source or two. For details, please look at the documentation section titled 'Loading Customized Queries and Sources'.
- In the main folder, there is a file listing parameters that needs to be added to kettle.properties. Copy the contents of ap_kettle.properties into ~/.kettle/kettle.properties. If this folder does not exist yet, run PDI and it will create the folder.
- The other major configuration piece is creating the database connection scripts. There is a template script under /generic/source_connector/retrieve_CONNECTION_NAME_table_sample_tr_sample.ktr, where CONNECTION_NAME is the name used in the profile_data_source_list.xls column 'source_connection_name'. Opening this transformation in PDI, one can save a copy for each source connection and modify the Table Input step's database connection. 5.Move the conf_template.xml file to where Data Cleaner 2.x is installed. This file is used for a work around until support for 3.x is implemented.
- Run sh /<code_location_parent_directory>/etl_code/quality/generic/profile_sample_generator_start.sh process_custom so that the new data sources can be populated into the profiling tool.
- Setup should now be complete.