Skip to content
dbaAlex edited this page Feb 14, 2013 · 4 revisions

Still a work in progress, but this page will serve as a quick start to help to get the Automated Profiling Tool up and Running.

Requirements

  • xmlstarlet - for parsing xml files (using Windows? Install cygwin http://www.cygwin.com/ to use xmlstarlet. This requirement will go away with Data Cleaner 3.x support)
  • Data Cleaner 2.x - 3.x is not yet fully supported. Will be coming in the next major release.
  • Pentaho Data Integrator (aka Kettle) 4.3 or higher
  • MySQL 5.1 or higher

#Setup

  1. In the database folder, there are two SQL scripts. First run profile_schema.sql from the mysql command line. Next, run the profile_unknown_record.sql
  2. Now that the profile tool's database is set up we can continue configuration. The profile_customization folder holds two spreadsheets, one for data source information and the other for source query information. For now, open the profile_data_source_list.xls and fill in a source or two. For details, please look at the documentation section titled 'Loading Customized Queries and Sources'.
  3. In the main folder, there is a file listing parameters that needs to be added to kettle.properties. Copy the contents of ap_kettle.properties into ~/.kettle/kettle.properties. If this folder does not exist yet, run PDI and it will create the folder.
  4. The other major configuration piece is creating the database connection scripts. There is a template script under /generic/source_connector/retrieve_CONNECTION_NAME_table_sample_tr_sample.ktr, where CONNECTION_NAME is the name used in the profile_data_source_list.xls column 'source_connection_name'. Opening this transformation in PDI, one can save a copy for each source connection and modify the Table Input step's database connection. 5.Move the conf_template.xml file to where Data Cleaner 2.x is installed. This file is used for a work around until support for 3.x is implemented.
  5. Run sh /<code_location_parent_directory>/etl_code/quality/generic/profile_sample_generator_start.sh process_custom so that the new data sources can be populated into the profiling tool.
  6. Setup should now be complete.

Clone this wiki locally