Elastic Open Web Crawler CLI is a command-line interface for use in the terminal or scripts. This is the only user interface for interacting with the Open Crawler.
Before using the CLI, make sure you can run the crawler either in Docker or from source.
For details on how to configure the crawler, see the CONFIG.md.
To run individual CLI commands in Docker, you can use the following format
docker run -it docker.elastic.co/integrations/crawler:latest jruby \
bin/crawler <command> <args>For an interactive shell with the crawler, simply change the entrypoint to /bin/bash:
docker run -it --entrypoint /bin/bash docker.elastic.co/integrations/crawler:latestIf you need to mount a file into the container, for example crawl-config.yml, you can use the -v option to mount it into the container:
docker run -it -v ./crawl-config.yml:/crawl-config.yml docker.elastic.co/integrations/crawler:latest jruby \
bin/crawler crawl /crawl-config.yml
## Available commands
### Getting help
Use the `--help or -h` option with any command to get more information.
For example:
```bash
$ bin/crawler --help
> Commands:
> crawler crawl CRAWL_CONFIG # Run a crawl of the site
> crawler schedule CRAWL_CONFIG # Schedule a recurrent crawl of the site
> crawler urltest CRAWL_CONFIG # Test a URL against a configuration
> crawler validate CRAWL_CONFIG # Validate crawler configuration
> crawler version # Print versionCrawls the configured domain in the provided config file. Can optionally take a second configuration file for Elasticsearch settings. See CONFIG.md for details on the configuration files.
# crawl using only crawler config
$ bin/crawler crawl config/examples/parks-australia.yml# crawl using crawler config and optional --es-config
$ bin/crawler crawl config/examples/parks-australia.yml --es-config=config/es.ymlCreates a schedule to recurrently crawl the configured domain in the provided config file.
The scheduler uses a cron expression that is configured in the Crawler configuration file using the field schedule.pattern.
See scheduling recurring crawl jobs for details on scheduling.
Can optionally take a second configuration file for Elasticsearch settings. See CONFIG.md for details on the configuration files.
# schedule crawls using only crawler config
$ bin/crawler schedule config/examples/parks-australia.yml# schedule crawls using crawler config and optional --es-config
$ bin/crawler schedule config/examples/parks-australia.yml --es-config=config/es.ymlCrawls a single URL against the provided crawler config and optional Elasticsearch config, and provides a brief summary of the crawl as well as the downloaded document.
The downloaded document will appear under /crawled_docs unless otherwise specified with the output_dir config
field in your crawler config.
> bin/crawler urltest config/my-crawler.yml https://www.speedhunters.com/2025/01/project-964-hitting-the-touge-for-the-first-time-in-rwb-form/
[2025-04-10T09:26:10.806Z] [crawl:67f7c6f2714375360db0a1b8] [primary] Initialized an in-memory URL queue for up to 10000 URLs
[2025-04-10T09:26:10.810Z] [crawl:67f7c6f2714375360db0a1b8] [primary] ... // logs truncated for brevity
[2025-04-10T09:26:15.100Z] [crawl:67f7c6f2714375360db0a1b8] [primary] Finished a crawl. Result: failure; Successfully finished the primary crawl with an empty crawl queue |
---- URL Test Results ----
- Attempted to crawl https://www.speedhunters.com/2025/01/project-964-hitting-the-touge-for-the-first-time-in-rwb-form/
- Status code: 200
- Content type: text/html; charset=UTF-8
- Crawl duration (seconds): 2.8990111351013184
- Extracted links:
- http://store.speedhunters.com
- http://store.speedhunters.com
- http://www.speedhunters.com/category/carfeatures/
- http://www.speedhunters.com/tag/car-spotlight/
- http://www.speedhunters.com/tag/dragracing/
- https://www.speedhunters.com
- https://www.speedhunters.com
- https://www.speedhunters.com/2025/01/project-964-hitting-the-touge-for-the-first-time-in-rwb-form/#content
- https://www.speedhunters.com/category/content/
- https://www.speedhunters.com/category/content/special-feature/
You can find the downloaded document under ./crawled_docsChecks the configured domains in domain_allowlist to see if they can be crawled.
# when valid
$ bin/crawler validate path/to/crawler.yml
> Domain https://www.elastic.co is valid# when invalid (e.g. has a redirect)
$ bin/crawler validate path/to/invalid-crawler.yml
> Domain https://elastic.co is invalid:
> The web server at https://elastic.co redirected us to a different domain URL (https://www.elastic.co/).
> If you want to crawl this site, please configure https://www.elastic.co as one of the domains.Checks the product version of Crawler
$ bin/crawler version
> v0.2.0