Skip to content
This repository was archived by the owner on Dec 8, 2025. It is now read-only.

pratikkabade/Data-Anonymization-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Data Anonymization Tool

This project provides a Python-based tool for anonymizing sensitive data, specifically designed to address data transfer restrictions (e.g., preventing data from being moved outside of a specific geographic region like the UK). The tool leverages the pandas library to efficiently process and anonymize data by replacing original values with pre-defined "dump" values.

Table of Contents

Problem Statement

Many organizations face strict regulations regarding data residency and transfer, particularly for personally identifiable information (PII) or sensitive business data. For instance, data generated or stored within the UK might be legally restricted from being transferred outside the UK. Manually anonymizing large datasets can be time-consuming, error-prone, and inefficient. This tool automates the anonymization process, allowing for secure data handling while maintaining compliance with such regulations.

Solution

This project offers a robust and flexible solution to anonymize data in place or before transfer. It utilizes the power of the pandas library for data manipulation and employs a mapping approach to replace sensitive data fields with values from pre-defined "dump" lists.

Features

  • Efficient Data Processing: Built on pandas, ensuring efficient handling of large datasets.
  • Dump-Based Anonymization: Replaces original values with entries from provided "dump" lists, ensuring a controlled and consistent anonymization.
  • Configurable Anonymization: Easily extends to different columns and dump files.
  • Email Suffix Addition: Automatically generates anonymized email addresses based on a specified suffix.
  • Python-based: Easy to integrate into existing Python workflows.
  • Secure Data Handling: Helps in maintaining compliance with data residency regulations.

Anonymization Logic

The core of the anonymization is the mask function. This function takes a DataFrame, a column to be anonymized, and a "dump" list of replacement values.

The mask function works as follows:

  1. It identifies all unique values in the target column (col_to_mask).
  2. It creates a final_dump list for mapping:
    • If the number of unique items in the column is less than the length of the provided dump, it takes the first N items from the dump (where N is the number of unique items).
    • If the number of unique items is greater than or equal to the dump length, it cycles through the dump list to ensure a replacement for every unique item. This means dump values can be reused if there are more unique original values than dump values.
  3. It then iterates through each value in the target column. For each value, it finds its index within the unique_mask_items and uses that index to fetch the corresponding replacement value from final_dump.
  4. Finally, the original column in the DataFrame is updated with the new, anonymized values.

Additionally, the add_suffix function creates a new column by concatenating the values from a source column with a specified suffix, primarily used for generating anonymized email addresses.

Current Anonymization Dump

The tool currently utilizes the following "dump" files for anonymization. These files contain lists of replacement values.

  • dump/names.csv: Used to anonymize the REPORTED_BY column. The column in this CSV should be named name.
  • dump/organizations.csv: Used to anonymize the REPORTED_BY_ORGANISATION column. The column in this CSV should be named organization.
  • dump/sites.csv: This file contains two columns:
    • site_id: Used to anonymize the SITE column (likely site IDs).
    • site_name: Used to anonymize the LOCATION column (likely site names or locations).

Important: Ensure these dump files exist in the dump/ directory relative to your script and contain the specified column headers.

Installation

  1. Clone the repository (or set up your project structure):

    git clone [https://github.com/your_username/your_project_name.git](https://github.com/your_username/your_project_name.git)
    cd your_project_name
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: `venv\Scripts\activate`
  3. Install dependencies:

    pip install pandas
  4. Set up your data and dump files:

    • Place your input CSV file (e.g., SCCD-filtered_anon_v2.csv) in the data/ directory.
    • Ensure your names.csv, organizations.csv, and sites.csv files are in the dump/ directory with the correct column names as described above.

Usage

The anonymization process is straightforward. Ensure your data and dump files are correctly placed as described in the installation steps.

The main script (mask.py, assuming the provided code snippet is in it) performs the following:

  1. Reads the input CSV data.
  2. Loads the anonymization dumps.
  3. Applies the mask function to the REPORTED_BY, REPORTED_BY_ORGANISATION, SITE, and LOCATION columns using the respective dumps.
  4. Adds an email suffix (@cognizant.com) to the REPORTED_BY column to create a new REPORTED_BY_EMAIL column.
  5. Saves the anonymized DataFrame to a new CSV file with _annon.csv appended to the original filename.

To run the anonymization:

python mask.py

About

(❌) A Python-based tool for anonymizing sensitive data, specifically designed to address data transfer restrictions

Topics

Resources

Stars

Watchers

Forks

Contributors

Languages