This project provides a Python-based tool for anonymizing sensitive data, specifically designed to address data transfer restrictions (e.g., preventing data from being moved outside of a specific geographic region like the UK). The tool leverages the pandas library to efficiently process and anonymize data by replacing original values with pre-defined "dump" values.
- Problem Statement
- Solution
- Features
- Anonymization Logic
- Current Anonymization Dump
- Installation
- Usage
- Configuration
- Contributing
- License
Many organizations face strict regulations regarding data residency and transfer, particularly for personally identifiable information (PII) or sensitive business data. For instance, data generated or stored within the UK might be legally restricted from being transferred outside the UK. Manually anonymizing large datasets can be time-consuming, error-prone, and inefficient. This tool automates the anonymization process, allowing for secure data handling while maintaining compliance with such regulations.
This project offers a robust and flexible solution to anonymize data in place or before transfer. It utilizes the power of the pandas library for data manipulation and employs a mapping approach to replace sensitive data fields with values from pre-defined "dump" lists.
- Efficient Data Processing: Built on
pandas, ensuring efficient handling of large datasets. - Dump-Based Anonymization: Replaces original values with entries from provided "dump" lists, ensuring a controlled and consistent anonymization.
- Configurable Anonymization: Easily extends to different columns and dump files.
- Email Suffix Addition: Automatically generates anonymized email addresses based on a specified suffix.
- Python-based: Easy to integrate into existing Python workflows.
- Secure Data Handling: Helps in maintaining compliance with data residency regulations.
The core of the anonymization is the mask function. This function takes a DataFrame, a column to be anonymized, and a "dump" list of replacement values.
The mask function works as follows:
- It identifies all unique values in the target column (
col_to_mask). - It creates a
final_dumplist for mapping:- If the number of unique items in the column is less than the length of the provided
dump, it takes the firstNitems from thedump(whereNis the number of unique items). - If the number of unique items is greater than or equal to the
dumplength, it cycles through thedumplist to ensure a replacement for every unique item. This meansdumpvalues can be reused if there are more unique original values than dump values.
- If the number of unique items in the column is less than the length of the provided
- It then iterates through each value in the target column. For each value, it finds its index within the
unique_mask_itemsand uses that index to fetch the corresponding replacement value fromfinal_dump. - Finally, the original column in the DataFrame is updated with the new, anonymized values.
Additionally, the add_suffix function creates a new column by concatenating the values from a source column with a specified suffix, primarily used for generating anonymized email addresses.
The tool currently utilizes the following "dump" files for anonymization. These files contain lists of replacement values.
dump/names.csv: Used to anonymize theREPORTED_BYcolumn. The column in this CSV should be namedname.dump/organizations.csv: Used to anonymize theREPORTED_BY_ORGANISATIONcolumn. The column in this CSV should be namedorganization.dump/sites.csv: This file contains two columns:site_id: Used to anonymize theSITEcolumn (likely site IDs).site_name: Used to anonymize theLOCATIONcolumn (likely site names or locations).
Important: Ensure these dump files exist in the dump/ directory relative to your script and contain the specified column headers.
-
Clone the repository (or set up your project structure):
git clone [https://github.com/your_username/your_project_name.git](https://github.com/your_username/your_project_name.git) cd your_project_name -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: `venv\Scripts\activate`
-
Install dependencies:
pip install pandas
-
Set up your data and dump files:
- Place your input CSV file (e.g.,
SCCD-filtered_anon_v2.csv) in thedata/directory. - Ensure your
names.csv,organizations.csv, andsites.csvfiles are in thedump/directory with the correct column names as described above.
- Place your input CSV file (e.g.,
The anonymization process is straightforward. Ensure your data and dump files are correctly placed as described in the installation steps.
The main script (mask.py, assuming the provided code snippet is in it) performs the following:
- Reads the input CSV data.
- Loads the anonymization dumps.
- Applies the
maskfunction to theREPORTED_BY,REPORTED_BY_ORGANISATION,SITE, andLOCATIONcolumns using the respective dumps. - Adds an email suffix (
@cognizant.com) to theREPORTED_BYcolumn to create a newREPORTED_BY_EMAILcolumn. - Saves the anonymized DataFrame to a new CSV file with
_annon.csvappended to the original filename.
To run the anonymization:
python mask.py