Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
42 changes: 42 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
{
"name": "ExplorePi Python Environment",
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",

"features": {
"ghcr.io/devcontainers/features/git:1": {},
"ghcr.io/devcontainers/features/github-cli:1": {}
},

"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance",
"ms-python.black-formatter",
"ms-python.isort",
"ms-python.flake8",
"charliermarsh.ruff",
"github.copilot",
"eamodio.gitlens"
],
"settings": {
"python.defaultInterpreterPath": "/usr/local/bin/python",
"python.linting.enabled": true,
"python.linting.flake8Enabled": true,
"python.formatting.provider": "black",
"editor.formatOnSave": true,
"editor.codeActionsOnSave": {
"source.organizeImports": "explicit"
},
"python.testing.pytestEnabled": true,
"python.testing.unittestEnabled": false
}
}
},

"forwardPorts": [8000, 5000, 3000],

"postCreateCommand": "pip install --upgrade pip && pip install -r requirements.txt",

"remoteUser": "vscode"
}
224 changes: 224 additions & 0 deletions crawler/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
# Contributing to Crawler

Thank you for your interest in contributing to the Crawler project! We welcome contributions from everyone. This document provides guidelines and instructions for contributing.

## Code of Conduct

We are committed to providing a welcoming and inspiring community for all. Please be respectful and constructive in all interactions. Harassment, discrimination, or disruptive behavior will not be tolerated.

## How to Contribute

There are many ways to contribute to this project:

- **Report bugs** by opening an issue with detailed information
- **Suggest features** with clear use cases and expected behavior
- **Improve documentation** by fixing typos or clarifying confusing sections
- **Submit code changes** by creating pull requests with meaningful improvements
- **Review pull requests** and provide constructive feedback to other contributors

## Getting Started

### Prerequisites

- Python 3.8 or higher
- Git
- A MySQL database for testing (optional but recommended)
- A code editor or IDE of your choice

### Setting Up Your Development Environment

1. Fork the repository on GitHub
2. Clone your fork locally:
```bash
git clone https://github.com/your-username/crawler.git
cd crawler
```
3. Create a virtual environment:
```bash
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
4. Install development dependencies:
```bash
pip install -r requirements-dev.txt
```
5. Create a local `.env` file for testing:
```bash
cp .env.example .env
```

## Making Changes

### Branch Naming

Create a descriptive branch name for your changes:
- `feature/add-proxy-support`
- `bugfix/fix-mysql-connection-timeout`
- `docs/improve-readme`
- `test/add-crawler-tests`

```bash
git checkout -b feature/your-feature-name
```

### Code Style

Follow these guidelines to maintain consistent code quality:

- Use PEP 8 style guide for Python code
- Keep lines under 100 characters when possible
- Use meaningful variable and function names
- Add docstrings to functions and classes
- Use type hints where applicable

Example:
```python
def fetch_url(url: str, timeout: int = 10) -> str:
"""
Fetch content from a given URL.

Args:
url: The URL to fetch
timeout: Request timeout in seconds (default: 10)

Returns:
The HTML content of the page

Raises:
requests.exceptions.RequestException: If the request fails
"""
response = requests.get(url, timeout=timeout)
response.raise_for_status()
return response.text
```

### Testing

Before submitting a pull request, ensure your code passes all tests:

```bash
# Run all tests
pytest

# Run tests with coverage
pytest --cov=crawler

# Run specific test file
pytest tests/test_crawler.py
```

Write tests for new features:
```python
def test_fetch_url_success():
"""Test that fetch_url returns content for valid URLs."""
result = fetch_url("https://example.com")
assert result is not None
assert len(result) > 0
```

### Commits

Write clear, descriptive commit messages:

```bash
# Good
git commit -m "Add proxy support to crawler

- Add ProxyManager class to handle proxy rotation
- Update fetch_url to accept proxy configuration
- Add tests for proxy connection handling"

# Avoid
git commit -m "fix stuff"
git commit -m "changes"
```

## Submitting Changes

### Pull Request Process

1. Ensure all tests pass and code is formatted correctly
2. Push your branch to your fork:
```bash
git push origin feature/your-feature-name
```
3. Open a pull request on GitHub with:
- A clear title describing the change
- A detailed description of what was changed and why
- Reference to any related issues (e.g., "Fixes #123")
- Screenshots or examples if applicable
4. Address review comments and make requested changes
5. Ensure the CI/CD pipeline passes
6. Once approved, your PR will be merged

### Pull Request Template

```markdown
## Description
Brief explanation of what this PR does.

## Changes Made
- Change 1
- Change 2
- Change 3

## Related Issues
Fixes #123

## Testing
Describe how you tested these changes.

## Checklist
- [ ] Code follows style guidelines
- [ ] Tests pass locally
- [ ] Documentation is updated
- [ ] No breaking changes (or documented in PR)
```

## Reporting Bugs

When reporting bugs, please include:

- **Description**: What you were trying to do
- **Expected behavior**: What should have happened
- **Actual behavior**: What actually happened
- **Environment**: Python version, OS, MySQL version
- **Steps to reproduce**: Clear steps to replicate the issue
- **Error message**: Full error traceback if available
- **Screenshots**: If applicable

Example:
```
Title: Crawler fails with timeout on large datasets

Description: When crawling more than 10,000 pages, the crawler
consistently times out.

Steps to reproduce:
1. Configure crawler with 15,000 pages
2. Run `python crawler.py`
3. After ~8,000 pages, connection fails

Expected: Crawler should complete all 15,000 pages
Actual: Crawler crashes with timeout error

Environment: Python 3.9, Ubuntu 20.04, MySQL 8.0
```

## Suggesting Features

When suggesting features, explain:

- **Use case**: Why this feature is needed
- **Expected behavior**: How it should work
- **Alternative approaches**: Other possible implementations
- **Impact**: How it affects existing functionality

## Documentation

Help improve documentation by:

- Fixing typos and grammatical errors
- Adding missing sections or examples
- Clarifying confusing explanations
- Adding inline code comments for complex logic
21 changes: 21 additions & 0 deletions crawler/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2025 Crawler Project Contributors

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
2 changes: 1 addition & 1 deletion crawler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,7 +61,7 @@ npm start

## ⛏️ Built Using <a name = "built_using"></a>

- [MYSQL](https://www.mysql.com/) - Database
- [MYSQL](https://www.mongodb.com/) - Database
- [NodeJs](https://nodejs.org/en/) - Server Environment
- [StellarSDK](https://github.com/stellar/js-stellar-sdk) - BlockchainTool
## ✍️ Authors <a name = "authors"></a>
Expand Down
Loading