Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
72af7eb
Implement multi-language support for UI
arifinahmad99-cloud Jan 10, 2026
d4a3a35
Add GitHub Actions workflow for Azure Node.js deployment
arifinahmad99-cloud Jan 11, 2026
0112e22
Merge pull request #1 from arifinahmad99-cloud/patch-1
arifinahmad99-cloud Jan 13, 2026
f4f7155
Update privacy policy last updated date
arifinahmad99-cloud Jan 15, 2026
af40f75
Revert "corrected crawler README.md MySQL link"
arifinahmad99-cloud Jan 18, 2026
dd7f06e
Create README.md
arifinahmad99-cloud Jan 18, 2026
453469b
Create CONTRIBUTING.md
arifinahmad99-cloud Jan 18, 2026
ea3c366
Create node.yml
arifinahmad99-cloud Jan 18, 2026
c81bf0c
Create CONTRIBUTING.md
arifinahmad99-cloud Jan 19, 2026
bc35966
Create LICENSE
arifinahmad99-cloud Jan 19, 2026
aa2de45
Merge pull request #2 from arifinahmad99-cloud/revert-1-main
arifinahmad99-cloud Jan 19, 2026
0862923
Add Python CI/CD workflow with JSON processing
arifinahmad99-cloud Feb 11, 2026
21efb24
Add JSON data processor and validator script
arifinahmad99-cloud Feb 11, 2026
5950012
Add requirements.txt for project dependencies
arifinahmad99-cloud Feb 11, 2026
9be26c5
Revise README for ExplorePi project details
arifinahmad99-cloud Feb 11, 2026
d4a50ae
Remove author section from README
arifinahmad99-cloud Feb 11, 2026
69d5107
Add FastAPI JSON Data API service
arifinahmad99-cloud Feb 11, 2026
fa5a1cf
Add JSON data synchronization workflow
arifinahmad99-cloud Feb 11, 2026
2abeb65
Add .gitignore for Python and general project files
arifinahmad99-cloud Feb 11, 2026
35b88ff
Update and rename azure-webapps-node.yml to docker-compose.yml
arifinahmad99-cloud Feb 11, 2026
8e8020e
Add initial devcontainer configuration
arifinahmad99-cloud Feb 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"image": "mcr.microsoft.com/devcontainers/universal:2",
"features": {}
}
127 changes: 127 additions & 0 deletions .github/workflows/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Python
__pycache__/
*.py[cod]
*$py.class
*.so
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
*.manifest
*.spec
pip-log.txt
pip-delete-this-directory.txt

# Virtual Environment
venv/
ENV/
env/
.venv

# IDE
.vscode/
.idea/
*.swp
*.swo
*~
.DS_Store

# Testing
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.log
.hypothesis/
.pytest_cache/
htmlcov/
.mypy_cache/
.dmypy.json
dmypy.json

# Documentation
docs/_build/
.sphinx/

# Environment
.env
.env.local
.env.*.local
*.env

# Database
*.db
*.sqlite
*.sqlite3

# Backups
backups/
*.tar.gz
*.zip
*.bak

# Logs
logs/
*.log
npm-debug.log*
yarn-debug.log*
yarn-error.log*

# OS
.DS_Store
Thumbs.db
desktop.ini

# Temporary
tmp/
temp/
*.tmp

# Reports
reports/*.html
reports/*.json
!reports/.gitkeep

# Processed data (keep structure, ignore content)
processed/*
!processed/.gitkeep

# Security
*.pem
*.key
*.cert
secrets/

# Node (if using any JS tools)
node_modules/
package-lock.json

# Jupyter Notebook
.ipynb_checkpoints
*.ipynb

# Profiling
*.prof
*.lprof
.profiling/

# Docker
docker-compose.override.yml
224 changes: 224 additions & 0 deletions .github/workflows/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Contributing to Crawler

Thank you for your interest in contributing to the Crawler project! We welcome contributions from everyone. This document provides guidelines and instructions for contributing.

## Code of Conduct

We are committed to providing a welcoming and inspiring community for all. Please be respectful and constructive in all interactions. Harassment, discrimination, or disruptive behavior will not be tolerated.

## How to Contribute

There are many ways to contribute to this project:

- **Report bugs** by opening an issue with detailed information
- **Suggest features** with clear use cases and expected behavior
- **Improve documentation** by fixing typos or clarifying confusing sections
- **Submit code changes** by creating pull requests with meaningful improvements
- **Review pull requests** and provide constructive feedback to other contributors

## Getting Started

### Prerequisites

- Python 3.8 or higher
- Git
- A MySQL database for testing (optional but recommended)
- A code editor or IDE of your choice

### Setting Up Your Development Environment

1. Fork the repository on GitHub
2. Clone your fork locally:
```bash
git clone https://github.com/your-username/crawler.git
cd crawler
```
3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
4. Install development dependencies:
```bash
pip install -r requirements-dev.txt
```
5. Create a local `.env` file for testing:
```bash
cp .env.example .env
```

## Making Changes

### Branch Naming

Create a descriptive branch name for your changes:
- `feature/add-proxy-support`
- `bugfix/fix-mysql-connection-timeout`
- `docs/improve-readme`
- `test/add-crawler-tests`

```bash
git checkout -b feature/your-feature-name
```

### Code Style

Follow these guidelines to maintain consistent code quality:

- Use PEP 8 style guide for Python code
- Keep lines under 100 characters when possible
- Use meaningful variable and function names
- Add docstrings to functions and classes
- Use type hints where applicable

Example:
```python
def fetch_url(url: str, timeout: int = 10) -> str:
"""
Fetch content from a given URL.

Args:
url: The URL to fetch
timeout: Request timeout in seconds (default: 10)

Returns:
The HTML content of the page

Raises:
requests.exceptions.RequestException: If the request fails
"""
response = requests.get(url, timeout=timeout)
response.raise_for_status()
return response.text
```

### Testing

Before submitting a pull request, ensure your code passes all tests:

```bash
# Run all tests
pytest

# Run tests with coverage
pytest --cov=crawler

# Run specific test file
pytest tests/test_crawler.py
```

Write tests for new features:
```python
def test_fetch_url_success():
"""Test that fetch_url returns content for valid URLs."""
result = fetch_url("https://example.com")
assert result is not None
assert len(result) > 0
```

### Commits

Write clear, descriptive commit messages:

```bash
# Good
git commit -m "Add proxy support to crawler

- Add ProxyManager class to handle proxy rotation
- Update fetch_url to accept proxy configuration
- Add tests for proxy connection handling"

# Avoid
git commit -m "fix stuff"
git commit -m "changes"
```

## Submitting Changes

### Pull Request Process

1. Ensure all tests pass and code is formatted correctly
2. Push your branch to your fork:
```bash
git push origin feature/your-feature-name
```
3. Open a pull request on GitHub with:
- A clear title describing the change
- A detailed description of what was changed and why
- Reference to any related issues (e.g., "Fixes #123")
- Screenshots or examples if applicable
4. Address review comments and make requested changes
5. Ensure the CI/CD pipeline passes
6. Once approved, your PR will be merged

### Pull Request Template

```markdown
## Description
Brief explanation of what this PR does.

## Changes Made
- Change 1
- Change 2
- Change 3

## Related Issues
Fixes #123

## Testing
Describe how you tested these changes.

## Checklist
- [ ] Code follows style guidelines
- [ ] Tests pass locally
- [ ] Documentation is updated
- [ ] No breaking changes (or documented in PR)
```

## Reporting Bugs

When reporting bugs, please include:

- **Description**: What you were trying to do
- **Expected behavior**: What should have happened
- **Actual behavior**: What actually happened
- **Environment**: Python version, OS, MySQL version
- **Steps to reproduce**: Clear steps to replicate the issue
- **Error message**: Full error traceback if available
- **Screenshots**: If applicable

Example:
```
Title: Crawler fails with timeout on large datasets

Description: When crawling more than 10,000 pages, the crawler
consistently times out.

Steps to reproduce:
1. Configure crawler with 15,000 pages
2. Run `python crawler.py`
3. After ~8,000 pages, connection fails

Expected: Crawler should complete all 15,000 pages
Actual: Crawler crashes with timeout error

Environment: Python 3.9, Ubuntu 20.04, MySQL 8.0
```

## Suggesting Features

When suggesting features, explain:

- **Use case**: Why this feature is needed
- **Expected behavior**: How it should work
- **Alternative approaches**: Other possible implementations
- **Impact**: How it affects existing functionality

## Documentation

Help improve documentation by:

- Fixing typos and grammatical errors
- Adding missing sections or examples
- Clarifying confusing explanations
- Adding inline code comments for complex logic
Loading