You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-6Lines changed: 3 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -790,8 +790,7 @@ The program then writes that one record into a local Parquet file, does a second
790
790
791
791
### Bonus: download a full crawl index and query with DuckDB
792
792
793
-
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run
794
-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
793
+
If you want to run many of these queries, and you have a lot of disk space, you'll want to download the 300 gigabyte index and query it repeatedly. Run:
795
794
796
795
```shell
797
796
mkdir -p 'crawl=CC-MAIN-2024-22/subset=warc'
@@ -822,7 +821,7 @@ rm cc-index-table.paths
822
821
cd -
823
822
```
824
823
825
-
The structure should be something like this:
824
+
In both ways, the file structure should be something like this:
826
825
```shell
827
826
tree my_data
828
827
my_data
@@ -835,10 +834,8 @@ my_data
835
834
836
835
Then, you can run `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` to run the same query as above, but this time using your local copy of the index files.
837
836
838
-
> [!IMPORTANT]
839
-
> If you happen to be using the Common Crawl Foundation development server, we've already downloaded these files, and you can run ```make duck_ccf_local_files```
837
+
Both `make duck_ccf_local_files` and `make duck_local_files LOCAL_DIR=/path/to/the/downloaded/data` run the same SQL query and should return the same record (written as a parquet file).
840
838
841
-
All of these scripts run the same SQL query and should return the same record (written as a parquet file).
0 commit comments