Strds: String Dataset Creation Tool

This tool allows users to process and analyze Python repositories to create structured datasets for further analysis.

Features

Dataset Creation: Parse repositories to generate structured datasets in JSON format similar to the methods2test format.
Flexible Filters: Apply customizable filters during dataset creation.
Provide Code Samples: Allows to provide the relevant methods only or the entire repository code with the dependencies in a requirements.txt file to allow for dynamic analysis.

Installation

You can either use Poetry or pip to install the required dependencies. We recommend using Poetry for a clean and isolated environment on your local machine.

To set up the project and its dependencies, follow these steps:

Clone this repository to your local machine:

git clone https://gitlab.infosun.fim.uni-passau.de/se2/pynguin/strds.git
cd strds

Install Python 3.13 and pip, if you haven't already.

Windows: Python, Pip

Ubuntu:

sudo apt-get update
sudo apt-get install -y python3.13 python3-pip

MacOS:
```
brew install python@3.13
```

Create a virtual environment and install the project's dependencies using Poetry:
```
poetry install
```
Activate the virtual environment:
```
poetry shell
```
For Developers only: Activate pre-commit hooks:
```
pre-commit install
```

Command Line Interface (CLI)

1. Dataset Mining

Mines repositories from PyPi and GitHub to a CSV file.

poetry run mine --sample-size <size> --random-seed <seed> --project-list-file <file> [--use-top-packages] [--top-packages-count <count>] [--redirect-github-urls] [--remove-duplicates] [--remove-no-github-url-found] [--csv-output <file>]

Options:

--sample-size: Number of projects to sample. If not set, all projects are mined.
--random-seed: Random seed for reproducibility.
--project-list-file: Path to the project list file. If not set, all projects are considered.
--use-top-packages: Use top PyPI packages from hugovk.github.io/top-pypi-packages/ instead of fetching all PyPI projects (default: False).
--redirect-github-urls: Follow GitHub redirects (default: True).
--remove-duplicates: Remove duplicate projects (default: True).
--remove-no-github-url-found: Remove projects without GitHub URLs (default: True).
--csv-output: Path to store the CSV output (default: output/repos.csv).

Examples:

# Sample 10 random PyPI projects
poetry run mine --sample-size 10 --random-seed 42 --csv-output output/repos.csv

# Use top 50 PyPI packages
poetry run mine --use-top-packages --top-packages-count 50 --csv-output output/top_repos.csv

GitHub Token

To allow for more GitHub requests, generate and add a GitHub token:

Generate a GitHub token.
Set the token in your environment variables or add a .github_token file in the root directory of the project.
- Windows:
```
setx GITHUB_TOKEN <your_token>
```
- Linux/MacOS(Bash):
```
echo 'export GITHUB_TOKEN=<your_token>' >> ~/.bashrc
source ~/.bashrc
```

2. Dataset Creation

Creates a json file for the dataset.

poetry run dataset --csv-file <csv> --tmp-dir <dir> [--keep-tmp-dir] [--output <json>] [--filters <filters>]

Options:

--csv-file: Path to the CSV file containing project definitions (required).
--tmp-dir: Temporary directory to clone repositories (required).
--keep-tmp-dir: Retain the temporary directory after execution (optional).
--output: Path to the output JSON file (default: output.json).
--filters: Comma-separated list of filters to apply (default: NoStringTypeFilter,EmptyFilter).

Example:

poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --output output/dataset.json

Available Filters

The following filters can be applied during dataset creation:

EmptyFilter: Removes empty modules and classes from the dataset.
- A class is considered empty if it has no methods.
- A module is considered empty if it has no functions and no classes.
NoStringTypeFilter: Keeps only functions and methods that have a string parameter or return type.
- Only considers exact str type annotations, not container types like list[str] or dict[str, str].
- Functions/methods without any string parameters or return types are removed.
PrivateModuleFilter: Removes all non-public modules from the dataset.
- In Python, modules that start with an underscore are considered non-public (private, package private, internal, etc.).
- Only modules with names that don't start with an underscore are kept.
TestModuleFilter: Removes all test modules from the dataset.
- Identifies test modules by checking module names starting with 'test_'.
- Also removes modules located in directories that start with 'test' or contain '/test' in their path.

You can specify multiple filters by separating them with commas:

poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --filters NoStringTypeFilter,EmptyFilter

3. Provide Dataset

Form the dataset json file the tool can provide relevant methods only or the entire repository code along with the dependencies in a requirements.txt file.

Providing Methods

poetry run provide methods --dataset <dataset> [--output-dir <dir>] [--without-type-annotations]

Options:

--dataset: Path to the dataset JSON file (required).
--output-dir: Directory to store extracted methods (default: all_code).
--without-type-annotations: Remove type annotations from methods (optional).

Example:

poetry run provide methods --dataset src/res/dataset.json --output-dir output

Providing Repositories

poetry run provide repositories --dataset <dataset> [--output-dir <dir>]

Options:

--dataset: Path to the dataset JSON file (required).
--output-dir: Directory to store extracted repository code (default: all_code).
--without-type-annotations: Remove type annotations from the code (optional).

Example:

poetry run provide repositories --dataset src/res/dataset.json --output-dir output

Dataset Structure

Repository

repository:
  name: string  # Name of the repository (matches the PyPI name)
  url: string  # Repository URL
  pypi_tag: string  # PyPI release tag of the repository
  git_commit_hash: string  # Specific commit hash of the repository
  modules: list  # List of modules in the repository

Module

module:
  name: string  # Module name
  file_path: string  # Relative path to the file containing the module
  functions: list  # List of standalone functions in the module
  classes: list  # List of classes in the module

Function

function:
  identifier: string  # Function name
  parameters: list  # List of parameters of the function
  annotations: string  # Function annotations
  return: string | null  # Return type of the function
  body: string  # Source code of the function
  signature: string  # Function signature (name + parameters + return type)
  full_signature: string  # Full function signature (annotations + name + parameters + return type)
  file: string  # Relative path to the file containing the function

Class

class:
  identifier: string  # Class name
  methods: list  # List of methods in the class
  superclasses: list  # Superclasses of the class
  fields: list  # Class fields (attributes)
  file: string  # Relative path to the file containing the class

Method

method:
  identifier: string  # Method name
  parameters: list  # List of parameters of the method
  annotations: string  # Method annotations
  return: string | null  # Return type of the method
  body: string  # Source code of the method
  signature: string  # Method signature (name + parameters + return type)
  full_signature: string  # Full method signature (annotations + name + parameters + return type)
  constructor: boolean  # Whether the method is a constructor

Parameter

parameter:
  identifier: string  # Parameter name
  type: string | null  # Type annotation of the parameter
  line_number: int  # Line number of the parameter definition
  col_offset: int  # Column offset of the parameter definition

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
.junie		.junie
.run		.run
LICENSES		LICENSES
docs		docs
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitlab-ci.yaml		.gitlab-ci.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
Makefile		Makefile
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Strds: String Dataset Creation Tool

Features

Installation

Command Line Interface (CLI)

1. Dataset Mining

GitHub Token

2. Dataset Creation

Available Filters

3. Provide Dataset

Providing Methods

Providing Repositories

Dataset Structure

Repository

Module

Function

Class

Method

Parameter

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Strds: String Dataset Creation Tool

Features

Installation

Command Line Interface (CLI)

1. Dataset Mining

GitHub Token

2. Dataset Creation

Available Filters

3. Provide Dataset

Providing Methods

Providing Repositories

Dataset Structure

Repository

Module

Function

Class

Method

Parameter

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages