Skip to content

LuKrO2011/strds

Repository files navigation

Strds: String Dataset Creation Tool

This tool allows users to process and analyze Python repositories to create structured datasets for further analysis.

Features

  • Dataset Creation: Parse repositories to generate structured datasets in JSON format similar to the methods2test format.
  • Flexible Filters: Apply customizable filters during dataset creation.
  • Provide Code Samples: Allows to provide the relevant methods only or the entire repository code with the dependencies in a requirements.txt file to allow for dynamic analysis.

Installation

You can either use Poetry or pip to install the required dependencies. We recommend using Poetry for a clean and isolated environment on your local machine.

To set up the project and its dependencies, follow these steps:

  1. Clone this repository to your local machine:

    git clone https://gitlab.infosun.fim.uni-passau.de/se2/pynguin/strds.git
    cd strds
  2. Install Python 3.13 and pip, if you haven't already.

    • Windows: Python, Pip
    • Ubuntu:
      sudo apt-get update
      sudo apt-get install -y python3.13 python3-pip
    • MacOS:
      brew install python@3.13
  3. Create a virtual environment and install the project's dependencies using Poetry:

    poetry install
  4. Activate the virtual environment:

    poetry shell
  5. For Developers only: Activate pre-commit hooks:

    pre-commit install

Command Line Interface (CLI)

1. Dataset Mining

Mines repositories from PyPi and GitHub to a CSV file.

poetry run mine --sample-size <size> --random-seed <seed> --project-list-file <file> [--use-top-packages] [--top-packages-count <count>] [--redirect-github-urls] [--remove-duplicates] [--remove-no-github-url-found] [--csv-output <file>]

Options:

  • --sample-size: Number of projects to sample. If not set, all projects are mined.
  • --random-seed: Random seed for reproducibility.
  • --project-list-file: Path to the project list file. If not set, all projects are considered.
  • --use-top-packages: Use top PyPI packages from hugovk.github.io/top-pypi-packages/ instead of fetching all PyPI projects (default: False).
  • --redirect-github-urls: Follow GitHub redirects (default: True).
  • --remove-duplicates: Remove duplicate projects (default: True).
  • --remove-no-github-url-found: Remove projects without GitHub URLs (default: True).
  • --csv-output: Path to store the CSV output (default: output/repos.csv).

Examples:

# Sample 10 random PyPI projects
poetry run mine --sample-size 10 --random-seed 42 --csv-output output/repos.csv

# Use top 50 PyPI packages
poetry run mine --use-top-packages --top-packages-count 50 --csv-output output/top_repos.csv

GitHub Token

To allow for more GitHub requests, generate and add a GitHub token:

  1. Generate a GitHub token.
  2. Set the token in your environment variables or add a .github_token file in the root directory of the project.
    • Windows:
      setx GITHUB_TOKEN <your_token>
    • Linux/MacOS(Bash):
      echo 'export GITHUB_TOKEN=<your_token>' >> ~/.bashrc
      source ~/.bashrc

2. Dataset Creation

Creates a json file for the dataset.

poetry run dataset --csv-file <csv> --tmp-dir <dir> [--keep-tmp-dir] [--output <json>] [--filters <filters>]

Options:

  • --csv-file: Path to the CSV file containing project definitions (required).
  • --tmp-dir: Temporary directory to clone repositories (required).
  • --keep-tmp-dir: Retain the temporary directory after execution (optional).
  • --output: Path to the output JSON file (default: output.json).
  • --filters: Comma-separated list of filters to apply (default: NoStringTypeFilter,EmptyFilter).

Example:

poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --output output/dataset.json

Available Filters

The following filters can be applied during dataset creation:

  • EmptyFilter: Removes empty modules and classes from the dataset.

    • A class is considered empty if it has no methods.
    • A module is considered empty if it has no functions and no classes.
  • NoStringTypeFilter: Keeps only functions and methods that have a string parameter or return type.

    • Only considers exact str type annotations, not container types like list[str] or dict[str, str].
    • Functions/methods without any string parameters or return types are removed.
  • PrivateModuleFilter: Removes all non-public modules from the dataset.

    • In Python, modules that start with an underscore are considered non-public (private, package private, internal, etc.).
    • Only modules with names that don't start with an underscore are kept.
  • TestModuleFilter: Removes all test modules from the dataset.

    • Identifies test modules by checking module names starting with 'test_'.
    • Also removes modules located in directories that start with 'test' or contain '/test' in their path.

You can specify multiple filters by separating them with commas:

poetry run dataset --csv-file src/res/repos.csv --tmp-dir tmp --filters NoStringTypeFilter,EmptyFilter

3. Provide Dataset

Form the dataset json file the tool can provide relevant methods only or the entire repository code along with the dependencies in a requirements.txt file.

Providing Methods

poetry run provide methods --dataset <dataset> [--output-dir <dir>] [--without-type-annotations]

Options:

  • --dataset: Path to the dataset JSON file (required).
  • --output-dir: Directory to store extracted methods (default: all_code).
  • --without-type-annotations: Remove type annotations from methods (optional).

Example:

poetry run provide methods --dataset src/res/dataset.json --output-dir output

Providing Repositories

poetry run provide repositories --dataset <dataset> [--output-dir <dir>]

Options:

  • --dataset: Path to the dataset JSON file (required).
  • --output-dir: Directory to store extracted repository code (default: all_code).
  • --without-type-annotations: Remove type annotations from the code (optional).

Example:

poetry run provide repositories --dataset src/res/dataset.json --output-dir output

Dataset Structure

Repository

repository:
  name: string  # Name of the repository (matches the PyPI name)
  url: string  # Repository URL
  pypi_tag: string  # PyPI release tag of the repository
  git_commit_hash: string  # Specific commit hash of the repository
  modules: list  # List of modules in the repository

Module

module:
  name: string  # Module name
  file_path: string  # Relative path to the file containing the module
  functions: list  # List of standalone functions in the module
  classes: list  # List of classes in the module

Function

function:
  identifier: string  # Function name
  parameters: list  # List of parameters of the function
  annotations: string  # Function annotations
  return: string | null  # Return type of the function
  body: string  # Source code of the function
  signature: string  # Function signature (name + parameters + return type)
  full_signature: string  # Full function signature (annotations + name + parameters + return type)
  file: string  # Relative path to the file containing the function

Class

class:
  identifier: string  # Class name
  methods: list  # List of methods in the class
  superclasses: list  # Superclasses of the class
  fields: list  # Class fields (attributes)
  file: string  # Relative path to the file containing the class

Method

method:
  identifier: string  # Method name
  parameters: list  # List of parameters of the method
  annotations: string  # Method annotations
  return: string | null  # Return type of the method
  body: string  # Source code of the method
  signature: string  # Method signature (name + parameters + return type)
  full_signature: string  # Full method signature (annotations + name + parameters + return type)
  constructor: boolean  # Whether the method is a constructor

Parameter

parameter:
  identifier: string  # Parameter name
  type: string | null  # Type annotation of the parameter
  line_number: int  # Line number of the parameter definition
  col_offset: int  # Column offset of the parameter definition

About

Strds: String Dataset Creation Tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages