You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update DuckDB queries and parquet file loading (#24)
* fix: load the parquet files recursively and use the query of duckdb to select the crawl and the subset, as was done in the Java Tour
* chore(docs): update documentation with instruction on how to download the crawl data with and without the AWS CLI
* fix: parametrize the crawl name
* fix: remove scripts and AWS in favour of cc-downloader
* fix: refer to the cc-downloader repo in case cargo is not available
* docs: more details on cc-downloader
* feat: update index_download_advice to recommend cc-downloader and check local files
* fix: trailing slash
Copy file name to clipboardExpand all lines: README.md
+39-2Lines changed: 39 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -546,9 +546,46 @@ The program then writes that one record into a local Parquet file, does a second
546
546
547
547
### Bonus: download a full crawl index and query with DuckDB
548
548
549
-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
549
+
In case you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly.
550
550
551
-
```make duck_local_files```
551
+
> [!IMPORTANT]
552
+
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
553
+
554
+
To download the crawl index, please use [cc-downloader](https://github.com/commoncrawl/cc-downloader), which is the official and recommended downloader for Common Crawl data.
555
+
556
+
The simplest way to install `cc-downloader` is through cargo, the Rust package manager. If you have Rust installed, you can run:
557
+
558
+
```shell
559
+
cargo install cc-downloader
560
+
```
561
+
562
+
> [!WARNING]
563
+
> `cc-downloader` will not be set up on your path by default, but you can run it by prepending the right path.
564
+
565
+
If cargo is not available or does not install, you can download the binaries, please check on [the cc-downloader official repository](https://github.com/commoncrawl/cc-downloader).
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
552
589
553
590
If the files aren't already downloaded, this command will give you
0 commit comments