This tool allows users to process and analyze Python repositories to create structured datasets for further analysis.
- Dataset Creation: Parse repositories to generate structured datasets in JSON format similar to the methods2test format.
- Flexible Filters: Apply customizable filters during dataset creation.
- Provide Code Samples: Allows to provide the relevant methods only or the entire repository code with the
dependencies in a
requirements.txtfile to allow for dynamic analysis.
You can either use Poetry or pip to install the required dependencies. We recommend using Poetry for a clean and isolated environment on your local machine.
To set up the project and its dependencies, follow these steps:
-
Clone this repository to your local machine:
git clone https://gitlab.infosun.fim.uni-passau.de/se2/pynguin/strds.git cd strds -
Install Python 3.13 and pip, if you haven't already.
-
Create a virtual environment and install the project's dependencies using Poetry:
poetry install
-
Activate the virtual environment:
poetry shell
-
For Developers only: Activate pre-commit hooks:
pre-commit install
Mines repositories from PyPi and GitHub to a CSV file.
poetry run mine --sample-size <size> --random-seed <seed> --project-list-file <file> [--use-top-packages] [--top-packages-count <count>] [--redirect-github-urls] [--remove-duplicates] [--remove-no-github-url-found] [--csv-output <file>]Options:
--sample-size: Number of projects to sample. If not set, all projects are mined.--random-seed: Random seed for reproducibility.--project-list-file: Path to the project list file. If not set, all projects are considered.--use-top-packages: Use top PyPI packages from hugovk.github.io/top-pypi-packages/ instead of fetching all PyPI projects (default:False).--redirect-github-urls: Follow GitHub redirects (default:True).--remove-duplicates: Remove duplicate projects (default:True).--remove-no-github-url-found: Remove projects without GitHub URLs (default:True).--csv-output: Path to store the CSV output (default:output/repos.csv).
Examples:
# Sample 10 random PyPI projects
poetry run mine --sample-size 10 --random-seed 42 --csv-output output/repos.csv
# Use top 50 PyPI packages
poetry run mine --use-top-packages --top-packages-count 50 --csv-output output/top_repos.csvTo allow for more GitHub requests, generate and add a GitHub token:
- Generate a GitHub token.
- Set the token in your environment variables or add a
.github_tokenfile in the root directory of the project.- Windows:
setx GITHUB_TOKEN <your_token>
- Linux/MacOS(Bash):
echo 'export GITHUB_TOKEN=<your_token>' >> ~/.bashrc source ~/.bashrc
- Windows:
Creates a json file for the dataset.
poetry run dataset --csv-file <csv> --tmp-dir <dir> [--keep-tmp-dir] [--output <json>] [--filters <filters>]Options:
--csv-file: Path to the CSV file containing project definitions (required).--tmp-dir: Temporary directory to clone repositories (required).--keep-tmp-dir: Retain the temporary directory after execution (optional).--output: Path to the output JSON file (default:output.json).--filters: Comma-separated list of filters to apply (default:NoStringTypeFilter,EmptyFilter).
Example:
poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --output output/dataset.jsonThe following filters can be applied during dataset creation:
-
EmptyFilter: Removes empty modules and classes from the dataset.
- A class is considered empty if it has no methods.
- A module is considered empty if it has no functions and no classes.
-
NoStringTypeFilter: Keeps only functions and methods that have a string parameter or return type.
- Only considers exact
strtype annotations, not container types likelist[str]ordict[str, str]. - Functions/methods without any string parameters or return types are removed.
- Only considers exact
-
PrivateModuleFilter: Removes all non-public modules from the dataset.
- In Python, modules that start with an underscore are considered non-public (private, package private, internal, etc.).
- Only modules with names that don't start with an underscore are kept.
-
TestModuleFilter: Removes all test modules from the dataset.
- Identifies test modules by checking module names starting with 'test_'.
- Also removes modules located in directories that start with 'test' or contain '/test' in their path.
You can specify multiple filters by separating them with commas:
poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --filters NoStringTypeFilter,EmptyFilterForm the dataset json file the tool can provide relevant methods only or the entire repository code along with the
dependencies in a requirements.txt file.
poetry run provide methods --dataset <dataset> [--output-dir <dir>] [--without-type-annotations]Options:
--dataset: Path to the dataset JSON file (required).--output-dir: Directory to store extracted methods (default:all_code).--without-type-annotations: Remove type annotations from methods (optional).
Example:
poetry run provide methods --dataset src/res/dataset.json --output-dir outputpoetry run provide repositories --dataset <dataset> [--output-dir <dir>]Options:
--dataset: Path to the dataset JSON file (required).--output-dir: Directory to store extracted repository code (default:all_code).--without-type-annotations: Remove type annotations from the code (optional).
Example:
poetry run provide repositories --dataset src/res/dataset.json --output-dir outputrepository:
name: string # Name of the repository (matches the PyPI name)
url: string # Repository URL
pypi_tag: string # PyPI release tag of the repository
git_commit_hash: string # Specific commit hash of the repository
modules: list # List of modules in the repositorymodule:
name: string # Module name
file_path: string # Relative path to the file containing the module
functions: list # List of standalone functions in the module
classes: list # List of classes in the modulefunction:
identifier: string # Function name
parameters: list # List of parameters of the function
annotations: string # Function annotations
return: string | null # Return type of the function
body: string # Source code of the function
signature: string # Function signature (name + parameters + return type)
full_signature: string # Full function signature (annotations + name + parameters + return type)
file: string # Relative path to the file containing the functionclass:
identifier: string # Class name
methods: list # List of methods in the class
superclasses: list # Superclasses of the class
fields: list # Class fields (attributes)
file: string # Relative path to the file containing the classmethod:
identifier: string # Method name
parameters: list # List of parameters of the method
annotations: string # Method annotations
return: string | null # Return type of the method
body: string # Source code of the method
signature: string # Method signature (name + parameters + return type)
full_signature: string # Full method signature (annotations + name + parameters + return type)
constructor: boolean # Whether the method is a constructorparameter:
identifier: string # Parameter name
type: string | null # Type annotation of the parameter
line_number: int # Line number of the parameter definition
col_offset: int # Column offset of the parameter definition