Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
140 changes: 140 additions & 0 deletions REWRITE-README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
# Git History Rewrite for Open-Sourcing

This directory contains scripts to rewrite git history for open-sourcing the h2 repository.

## What It Does

The rewrite process performs the following transformations on **all commits** in the repository history:

1. **Adds BSD-3-Clause-Clear copyright headers** to all source files (.c, .h, .S, .py, .sh, .pl, etc.)
2. **Normalizes email addresses** to @qti.qualcomm.com
3. **Removes internal references** from commit messages (github.qualcomm.com, Q6Auto, JIRA)
4. **Adds Signed-off-by lines** to all commit messages
5. **Sets committer = author** for all commits

## Files

- **rewrite-history.sh** - Master script that orchestrates the entire rewrite
- **add-copyright-file-callback.py** - Adds copyright headers to source files
- **email-fixes.py** - Normalizes email addresses
- **commit-callback.py** - Fixes author/committer names and adds Signed-off-by
- **sanitize-commit-messages.py** - Removes internal references from commit messages
- **git-filter-repo** - The git-filter-repo tool

## Usage

### For a Single Branch

```bash
# 1. Clone the repository (or checkout the branch you want to rewrite)
git clone <repo-url> h2-rewrite
cd h2-rewrite

# 2. Copy all the rewrite scripts to the repository root
cp /path/to/scripts/* .

# 3. Run the rewrite script
./rewrite-history.sh

# Or skip confirmation prompt:
./rewrite-history.sh --force
```

### For Multiple Branches

To rewrite multiple branches, you need to run the script on each branch separately:

```bash
# Method 1: Rewrite each branch in a separate clone
for branch in work develop feature-x; do
echo "Processing branch: $branch"
git clone <repo-url> h2-$branch
cd h2-$branch
git checkout $branch
cp /path/to/scripts/* .
./rewrite-history.sh --force
cd ..
done

# Method 2: Rewrite all branches in one go (advanced)
# This rewrites ALL branches at once since git-filter-repo processes all refs
git clone <repo-url> h2-all-branches
cd h2-all-branches
cp /path/to/scripts/* .
./rewrite-history.sh --force
# All branches will be rewritten
```

## Important Notes

### Before Running

1. **Make a backup!** This operation rewrites git history and cannot be easily undone
2. **Use a fresh clone** - Don't run this on your working repository
3. **Ensure all required files are present** - The script will check for this

### After Running

1. The `origin` remote will be removed (this is normal for git-filter-repo)
2. You'll need to add a new remote and force-push:
```bash
git remote add new-origin <new-repo-url>
git push new-origin --all --force
git push new-origin --tags --force
```

### Expected Results

- **Commit count**: May be slightly less than original (4-5 commits typically lost due to phantom references)
- **Copyright headers**: Present in all source files throughout entire history
- **Internal references**: Completely removed from commit messages
- **Email addresses**: All normalized to @qti.qualcomm.com

## Validation

The script automatically validates the rewrite and reports:
- ✓ Copyright headers present
- ✓ No internal references found
- ✓ Number of unique committers

You can also manually check:

```bash
# Check copyright in a file
git show HEAD:path/to/file.c | head -10

# Check for internal references
git log --all --format='%s' | grep -i 'github.qualcomm.com'

# List all committers
git log --all --format='%cn <%ce>' | sort -u
```

## Troubleshooting

### "Not in a git repository"
Make sure you're in the root of a git repository.

### "Required file not found"
Ensure all script files are in the current directory.

### "origin remote removed"
This is expected. Add a new remote to push to the new repository.

### Commit count decreased
This is normal. A few commits (typically 4-5) are filtered out because they are phantom references to non-existent commits in merge messages.

## Technical Details

The rewrite uses `git-filter-repo` with multiple callbacks:

1. **file-info-callback**: Modifies file contents to add copyright headers
2. **email-callback**: Normalizes email addresses
3. **commit-callback**: Fixes names and adds Signed-off-by
4. **message-callback**: Sanitizes commit messages

Each callback is applied to every commit in the repository history, ensuring consistent transformations throughout.

## Copyright

All scripts include the BSD-3-Clause-Clear copyright header that will be added to source files.
77 changes: 77 additions & 0 deletions add-copyright-blob-callback.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/usr/bin/env python3
"""
Blob callback for git-filter-repo to add copyright headers to source files.
This modifies file contents in git history to add BSD-3-Clause-Clear headers.
"""

# Copyright text
COPYRIGHT_TEXT = b"""Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.
SPDX-License-Identifier: BSD-3-Clause-Clear"""

# File extensions and their comment styles (as bytes)
COMMENT_STYLES = {
b'.c': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.h': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.cpp': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.hpp': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.cc': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.S': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.s': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.py': (b'# ', b'\n# ', b'\n\n'),
b'.sh': (b'# ', b'\n# ', b'\n\n'),
b'.pl': (b'# ', b'\n# ', b'\n\n'),
b'.java': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.js': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
}

def has_copyright(content):
"""Check if file already has a copyright header."""
first_part = content[:500].lower()
return b'copyright' in first_part or b'spdx-license-identifier' in first_part

def get_file_extension(filename):
"""Get file extension as bytes."""
if b'.' not in filename:
return None
return b'.' + filename.rsplit(b'.', 1)[1]

def add_copyright_to_blob(blob):
"""Add copyright header to blob content."""
# Get filename from blob
filename = blob.filename if hasattr(blob, 'filename') else b''

# Get file extension
ext = get_file_extension(filename)
if ext not in COMMENT_STYLES:
return # Not a file type we handle

# Get original content
original_data = blob.data

# Check if already has copyright
if has_copyright(original_data):
return # Already has copyright

# Get comment style
start, middle, end = COMMENT_STYLES[ext]

# Handle shebang for scripts
shebang = b""
content = original_data
if content.startswith(b'#!'):
lines = content.split(b'\n', 1)
shebang = lines[0] + b'\n'
content = lines[1] if len(lines) > 1 else b""

# Create copyright header
copyright_lines = COPYRIGHT_TEXT.split(b'\n')
header = start + middle.join(copyright_lines) + end

# Combine: shebang + copyright + original content
new_data = shebang + header + content

# Update blob data
blob.data = new_data

# This is the callback function that git-filter-repo will call
add_copyright_to_blob(blob)
101 changes: 101 additions & 0 deletions add-copyright-file-callback.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
#!/usr/bin/env python3
"""
File-info callback for git-filter-repo to add copyright headers to source files.
This modifies file contents in git history to add BSD-3-Clause-Clear headers.
"""

import re

# Copyright text (use explicit newline to avoid indentation issues)
COPYRIGHT_TEXT = b"Copyright (c) Qualcomm Technologies, Inc. and/or its subsidiaries.\nSPDX-License-Identifier: BSD-3-Clause-Clear"

# File extensions and their comment styles (as bytes)
COMMENT_STYLES = {
b'.c': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.h': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.cpp': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.hpp': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.cc': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.S': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.s': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.py': (b'# ', b'\n# ', b'\n\n'),
b'.sh': (b'# ', b'\n# ', b'\n\n'),
b'.pl': (b'# ', b'\n# ', b'\n\n'),
b'.java': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
b'.js': (b'/*\n * ', b'\n * ', b'\n */\n\n'),
}

def has_new_copyright(content):
"""Check if file already has the NEW copyright header."""
first_part = content[:500]
return b'SPDX-License-Identifier: BSD-3-Clause-Clear' in first_part

def remove_old_copyright(content):
"""Remove old copyright headers from content."""
# Pattern 1: Old Qualcomm copyright blocks with ====== borders
# These typically start with /*====== and end with ======*/
import re

# Remove old copyright blocks (the ones with ====== borders)
# Match from /*====== to the closing ======*/
pattern1 = rb'/\*={5,}.*?={5,}\*/'
content = re.sub(pattern1, b'', content, flags=re.DOTALL)

# Pattern 2: Simple copyright lines like "Copyright (c) 2013 by Qualcomm..."
# Remove standalone copyright comments
pattern2 = rb'/\*\s*Copyright \(c\).*?\*/'
content = re.sub(pattern2, b'', content, flags=re.DOTALL)

# Clean up multiple blank lines that may result
content = re.sub(rb'\n\n\n+', b'\n\n', content)

# Remove leading blank lines
content = content.lstrip(b'\n')

return content

def get_file_extension(filename):
"""Get file extension as bytes."""
if b'.' not in filename:
return None
return b'.' + filename.rsplit(b'.', 1)[1]

# Skip symbolic links (mode 120000 in octal)
if mode == b'120000':
return (filename, mode, blob_id)

# Get file extension
ext = get_file_extension(filename)

# Only process files with known extensions
if ext in COMMENT_STYLES:
# Get original content
original_data = value.get_contents_by_identifier(blob_id)

# Check if already has the NEW copyright
if not has_new_copyright(original_data):
# Remove any old copyright headers first
content = remove_old_copyright(original_data)

# Get comment style
start, middle, end = COMMENT_STYLES[ext]

# Handle shebang for scripts
shebang = b""
if content.startswith(b'#!'):
lines = content.split(b'\n', 1)
shebang = lines[0] + b'\n'
content = lines[1] if len(lines) > 1 else b""

# Create copyright header
copyright_lines = COPYRIGHT_TEXT.split(b'\n')
header = start + middle.join(copyright_lines) + end

# Combine: shebang + copyright + original content
new_data = shebang + header + content

# Insert new blob and get new blob_id
blob_id = value.insert_file_with_contents(new_data)

# Return the (possibly modified) file info
return (filename, mode, blob_id)
Loading
Loading