Skip to content

jooservices/crawlerx

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CrawlerX

CrawlerX is a ready-to-run Laravel 12 crawler service. It exposes one generic provider-based API, validates common crawl input, queues a crawl job, and lets provider crawler classes parse responses behind the shared crawling contract.

Official API

POST /api/v1/{provider}

Example:

curl -X POST http://127.0.0.1:8000/api/v1/onejav \
  -H "Accept: application/json" \
  -d "url=https://onejav.com/new" \
  -d "callback_url=https://client-app.test/webhooks/crawlerx"

Queued response:

{
  "success": true,
  "message": "Crawl job queued.",
  "data": {
    "provider": "onejav",
    "url": "https://onejav.com/new",
    "status": "queued"
  }
}

The job posts the completed or failed crawl result to callback_url. Because crawl requests and results are DB-less, callback_url is required. Unsupported providers return a clean JSON error and do not dispatch a job. The provider is always read from the route path, not the request body.

Architecture

POST /api/v1/{provider} -> CrawlerController -> CrawlRequest -> CrawlJob -> CrawlerService -> CrawlingResolver -> provider crawler -> AbstractBaseCrawling -> jooservices/client -> provider parse() -> CrawlingResultDto -> callback delivery.

The controller does not crawl or parse. The queued job performs the crawl and sends the callback. Provider classes own endpoint, options, site code, and parsing only.

Local Setup

PHP target version: PHP 8.5.

composer install
cp .env.example .env
php artisan key:generate
php artisan migrate
php artisan serve

Run the database queue worker:

php artisan queue:work --queue=crawlerx

Horizon

Horizon is installed for queue monitoring. Horizon requires Redis-backed queues, so keep the default database queue for simple local development unless Redis is configured.

To use Horizon:

QUEUE_CONNECTION=redis
CRAWLERX_QUEUE=crawlerx

Then run:

php artisan horizon

Quality

composer lint
composer test

Never commit failing lint or tests.

Git Workflow

Work on the current branch unless asked otherwise. Before committing:

git status
git branch --show-current
git config user.name "Viet Vu"
git config user.email "jooservices@gmail.com"
git config user.name
git config user.email
composer update
composer lint
composer test

Use short, meaningful commit messages and group commits by feature area. If composer update changes composer.lock, commit it with the relevant change. Completed work must be committed locally after successful checks; do not leave finished work in git status.

Detailed workflow docs:

  • docs/01-development/04-git-workflow.md
  • docs/01-development/05-dependency-policy.md
  • .github/skills/git-workflow/SKILL.md
  • .github/skills/dependency-and-package-policy/SKILL.md

Add A Provider

  1. Create a provider crawler under app/Services/Crawling/Sites/.
  2. Extend AbstractBaseCrawling.
  3. Implement endpoint, options, site code, and parse().
  4. Register the provider in config/crawlerx.php.
  5. Add mocked client tests with fixture HTML.
  6. Update docs when the public API or contracts change.

Non-Goals

  • No Laravel Modules.
  • No persistence for crawl results unless explicitly requested.
  • No repositories for crawl results unless explicitly requested.
  • No result/status/history/retry endpoints.
  • No provider-specific controllers.
  • No page-specific response contracts.
  • No primary item concept.
  • No relation concept.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors