Skip to content

Commit 08dcfbe

Browse files
Merge pull request #23 from commoncrawl/laurie/aws-removal
Laurie/aws removal
2 parents aabebc0 + 05a3ce7 commit 08dcfbe

3 files changed

Lines changed: 6 additions & 10 deletions

File tree

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -57,8 +57,8 @@ download_collinfo:
5757
curl -O https://index.commoncrawl.org/collinfo.json
5858

5959
CC-MAIN-2024-22.warc.paths.gz:
60-
@echo "downloading the list from s3, requires s3 auth even though it is free"
61-
@echo "note that this file should be in the repo"
60+
@echo "downloading the list from S3 requires S3 auth (even though it is free)"
61+
@echo "note that this file should already be in the repo"
6262
aws s3 ls s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2024-22/subset=warc/ | awk '{print $$4}' | gzip -9 > CC-MAIN-2024-22.warc.paths.gz
6363

6464
duck_local_files:

README.md

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ flowchart TD
1717
The goal of this whirlwind tour is to show you how a single webpage appears in all of these different places. That webpage is [https://an.wikipedia.org/wiki/Escopete](https://an.wikipedia.org/wiki/Escopete), which we crawled on the date 2024-05-18T01:58:10Z. On the way, we'll also explore the file formats we use and learn about some useful tools for interacting with our data!
1818

1919
In the Whirlwind Tour, we will:
20+
2021
1) explore the WARC, WET and WAT file formats used to store Common Crawl's data.
2122
2) play with some useful Python packages for interacting with the data: [warcio](https://github.com/webrecorder/warcio), [cdxj-indexer](https://github.com/webrecorder/cdxj-indexer),
2223
[cdx_toolkit](https://github.com/cocrawler/cdx_toolkit),
@@ -58,10 +59,6 @@ Next, let's install the necessary software for this tour:
5859

5960
This command will print out a screen-full of output and install the Python packages in `requirements.txt` to your venv.
6061

61-
### Install and configure AWS-CLI
62-
63-
We will use the AWS Command Line Interface (CLI) later in the tour to access the data stored in Common Crawl's S3 bucket. Instructions on how to install the AWS-CLI and configure your account are available on the [AWS website](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html).
64-
6562
## Task 1: Look at the crawl data
6663

6764
Common Crawl's website includes a [Get Started](https://commoncrawl.org/get-started) guide which summarises different ways to access the data and the file formats. We can use the dropdown menu to access the links for downloading crawls over HTTP(S):
@@ -179,7 +176,7 @@ The output has three sections, one each for the WARC, WET, and WAT. For each one
179176

180177
## Task 3: Index the WARC, WET, and WAT
181178

182-
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
179+
The example WARC files we've been using are tiny and easy to work with. The real WARC files are around a gigabyte in size and contain about 30,000 webpages each. What's more, we have around 24 million of these files! To read all of them, we could iterate, but what if we wanted random access so we could read just one particular record? We do that with an index.
183180
```mermaid
184181
flowchart LR
185182
warc --> indexer --> cdxj & columnar
@@ -344,7 +341,7 @@ python ./warcio-iterator.py testing.warc.gz
344341

345342
Make sure you compress WARCs the right way!
346343

347-
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures from AWS S3
344+
## Task 6: Use cdx_toolkit to query the full CDX index and download those captures
348345

349346
Some of our users only want to download a small subset of the crawl. They want to run queries against an index, either the CDX index we just talked about, or in the columnar index, which we'll talk about later.
350347

@@ -404,7 +401,7 @@ Next, we use the `cdxt` command `warc` to retrieve the content and save it local
404401

405402
Finally, we run `cdxj-indexer` on this new WARC to make a CDXJ index of it as in Task 3, and then iterate over the WARC using `warcio-iterator.py` as in Task 2.
406403

407-
## Task 7: Find the right part of the columnar index
404+
## Task 7: Find the right part of the columnar index
408405

409406
Now let's look at the columnar index, the other kind of index that Common Crawl makes available. This index is stored in parquet files so you can access it using SQL-based tools like AWS Athena and duckdb as well as through tables in your favorite table packages such as pandas, pyarrow, and polars.
410407

duck.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,6 @@ def get_files(algo, crawl):
7070
files = f'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl={crawl}/subset=warc/*.parquet'
7171
raise NotImplementedError('duckdb will throw an error because it cannot glob this')
7272
elif algo == 'cloudfront':
73-
prefix = f's3://commoncrawl/cc-index/table/cc-main/warc/crawl={crawl}/subset=warc/'
7473
external_prefix = f'https://data.commoncrawl.org/cc-index/table/cc-main/warc/crawl={crawl}/subset=warc/'
7574
file_file = f'{crawl}.warc.paths.gz'
7675

0 commit comments

Comments
 (0)