GitHub Diffs
Description
Dataset is on BigQuery as a table of commit hashes and messages.
Procedure
From commit hash and message, produce dict containing:
- Raw files before changes
- Commit message
- Diff file
This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.
We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.
Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a
Example
Give an example of the columns and data:
| before_file |
commit_message |
diff |
| ['from setuptools import setup, find_packages\n', '\n', 'setup(\n', ... ] |
Change version |
[{'addition_count': 1, 'deletion_count': 1, 'hunks': [[[3, 7], [3, 7], '', ' setup(', " name = 'denoising-diffusion-pytorch',", ' packages = find_packages(),', "- version = '0.26.1',", "+ version = '0.26.3',", " license='MIT',", " description = 'Denoising Diffusion Probabilistic " "Models - Pytorch',", " author = 'Phil Wang',"]], 'patch_info': <PatchInfo: diff --git a/setup.py b/setup.py>, 'src_file': 'a/setup.py', 'tgt_file': 'b/setup.py'}] |
GitHub Diffs
Description
Dataset is on BigQuery as a table of commit hashes and messages.
Procedure
From commit hash and message, produce dict containing:
This requires for each commit, downloading the files after changes and applying the reverse patch to obtain the files before changes.
We also need to decide on a suitable length threshold to filter on since we need to include most or all of the before file in the context window, which restricts the line numbers significantly.
Minimal working example here: https://gist.github.com/herbiebradley/b08d2e13775384fe4b5353e831dac43a
dataset.pybase classesExample
Give an example of the columns and data: