KBO Extractor
This file defines the Entity class, which is responsible for handling the extraction and transformation of data from XML files into CSV format. Here's a breakdown of its key components:
- The
Entityclass is initialized with parameters likeoutput_dir,name,ns(namespace),xpath,parent_tags,direct_tags, andlist_of_columns_and_types. - It creates a CSV file in the specified output directory and writes the header row based on the provided tags.
- It also initializes counters and lists for handling nested entities (L2 entities).
find_nodesandfind_value: These methods are used to extract data from XML nodes using XPath expressions.write_values: This method writes the extracted values to the CSV file. It handles date format conversion and ensures that certain text fields start with a "0" if required.convert_date_format: Converts date strings fromdd/mm/yyyyformat toyyyy-mm-dd.prepend_0: Ensures that certain identifiers start with a "0".close: Closes the CSV file after writing is complete.
This file is responsible for reading XML files, processing the data, and writing it to CSV files using the Entity class. Here's a breakdown of its key components:
- This function writes data to the CSV files by calling the
write_valuesmethod of theEntityclass. - It handles nested entities by recursively calling itself for L2 entities.
- This function creates an instance of the
Entityclass based on the configuration provided inconfig.py.
- This function iterates over XML elements efficiently, processing each element and clearing it from memory to save resources.
- It uses the
process_elementfunction to handle different types of XML elements (e.g.,Enterprise,BusinessUnit, etc.).
- This function processes different XML elements based on their tags (e.g.,
Enterprise,BusinessUnit,CancelledBusinessUnits, etc.). - It uses the
writefunction to write data to the appropriate CSV files.
- This is the main function that orchestrates the conversion process.
- It initializes the
Entityinstances, processes the XML files, and writes the data to CSV files.
This file contains configuration data for the entities and their corresponding XML tags, database column names, and data types. Here's a breakdown of its key components:
- Defines constants for data types like
INT,TEXT,DATE, etc.
- The
entity_configdictionary contains configurations for various entities likeenterprise,header,denomination, etc. - Each entity configuration includes:
parent_tags: Tags that are inherited from parent entities.direct_tags: Tags that are specific to the entity.xpath: The XPath expression used to locate the entity in the XML file.l2_entities: Nested entities that are children of the current entity.
- The
code_entity_configdictionary contains configurations for code-related entities likeactivitygroupcode,addresscode, etc. - These entities are used to map codes to their descriptions in different languages.
- This function generates a list of column names and their corresponding data types for a given entity.
This file sets up logging for the application. It configures both console and file logging with different log levels.
- Logs are written to both the console and a file (
application.log). - The console logs all levels (
DEBUG,INFO,WARNING,ERROR,CRITICAL), while the file logs onlyINFO,WARNING,ERROR, andCRITICAL.
reader.pyuses theEntityclass fromentity.pyto handle the extraction and writing of data.config.pyprovides the necessary configuration forreader.pyandentity.pyto map XML tags to database columns and data types.helper.pyis used across the application for logging purposes.
-
Initialization:
- The
convert_xml_to_csvfunction inreader.pyinitializes theEntityinstances based on the configuration inconfig.py.
- The
-
XML Processing:
- The
fast_iterfunction processes the XML files, extracting data using XPath expressions defined inconfig.py. - The
process_elementfunction handles different XML elements and writes the data to CSV files using theEntityclass.
- The
-
Data Transformation:
- The
Entityclass handles the transformation of data (e.g., date format conversion, prepending "0" to identifiers) before writing it to the CSV files.
- The
-
Logging:
- The application logs its progress and any errors using the logger configured in
helper.py.
- The application logs its progress and any errors using the logger configured in
- Error Handling: The code could benefit from more robust error handling, especially in the
convert_date_formatandprepend_0methods, where exceptions are caught but not handled gracefully. - Performance: The
fast_iterfunction is designed for performance, but further optimizations could be explored for very large XML files. - Configuration Management: The configuration in
config.pyis extensive and could be split into multiple files or managed using a configuration management tool.
The provided code is a well-structured XML to CSV converter that handles complex nested XML structures. It uses a combination of XPath for data extraction, configuration files for mapping, and logging for monitoring the process. The code is modular, with clear separation of concerns between data extraction, transformation, and logging.