Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 59 additions & 0 deletions METADATA_ANALYSIS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Image Metadata Analysis Implementation

## Overview

This enhancement adds automatic image metadata extraction to the book scraping workflow. The implementation includes:

## New Functions

### `extractImageMetadata[image_]`

Extracts comprehensive metadata from Mathematica Image objects:

- **ImageDimensions**: Gets width and height
- **Head**: Determines image type/format
- **ImageResolution**: Gets image resolution
- **ByteCount**: Calculates memory footprint

Returns an Association with structured metadata.

### Enhanced `func[lista_]`

The main scraping function now:

1. Initializes a `metadataList` to collect data
2. For each processed page:
- Extracts left and right page images
- Analyzes each image with `extractImageMetadata`
- Collects metadata with page context
- Handles failures gracefully
3. Exports all metadata to `book_images_metadata.csv`
4. Prints summary statistics

## CSV Output Format

| Column | Description |
|--------|-------------|
| Page | Source page number |
| ImageType | "Left" or "Right" page position |
| Status | "Success" or "Failed" processing status |
| Width | Image width in pixels |
| Height | Image height in pixels |
| Type | Mathematica image type |
| Resolution | Image resolution value |
| FileSize | Size in bytes |

## Error Handling

- Failed captures generate N/A values for technical metadata
- Status field tracks success/failure
- Process continues even if individual pages fail
- Final summary shows total records processed

## Integration

The metadata collection is seamlessly integrated into the existing workflow:
- Minimal performance impact
- No changes to existing PDF export functionality
- Preserves all original scraping behavior
- Adds value with zero disruption
32 changes: 32 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,34 @@
# wolfram-mathematica-codigo
Web Scraping Internet Archive Books Using Mathematica

## Features

- Scrapes books from Internet Archive
- Captures page screenshots
- Processes images by extracting specific sections
- **NEW**: Extracts image metadata (size, format, resolution, file size)
- **NEW**: Generates CSV report with metadata for each processed image

## Image Metadata Analysis

The enhanced scraping workflow now includes automatic analysis of downloaded images with the following metadata extraction:

- **Width and Height**: Image dimensions in pixels
- **Type**: Image format/type
- **Resolution**: Image resolution
- **File Size**: Size of the image in bytes
- **Status**: Processing status (Success/Failed)
- **Page**: Source page number
- **Image Type**: Position indicator (Left/Right page)

### Output

The system generates a `book_images_metadata.csv` file containing:
- One row per processed image
- All metadata fields as columns
- Status tracking for successful/failed processing

### Functions Added

- `extractImageMetadata[image]`: Extracts metadata from an image object
- Enhanced `func[lista]`: Main scraping function now collects and exports metadata
192 changes: 153 additions & 39 deletions last3.nb
Original file line number Diff line number Diff line change
Expand Up @@ -214,32 +214,77 @@ Cell[BoxData[
CellLabel->"In[38]:=",ExpressionUUID->"281b0cfd-c2d2-654f-a043-c46fbcd7ed29"]
}, Open ]],

Cell[CellGroupData[{

Cell["Capture pages", "Section",
CellChangeTimes->{{3.961967124424393*^9, 3.961967133263727*^9},
3.961977211620045*^9},ExpressionUUID->"50116fd0-d927-8e4b-94f2-\
Cell[CellGroupData[{

Cell["Image Metadata Analysis Functions", "Section",
CellChangeTimes->{{3.961967124424393*^9, 3.961967133263727*^9},
3.961977211620045*^9, {3.962018300000000*^9, 3.962018310000000*^9}},
ExpressionUUID->"metadata-analysis-section"],

Cell["Function to extract image metadata including size, format, and resolution.", "Text",
CellChangeTimes->{{3.962018320000000*^9, 3.962018330000000*^9}},
ExpressionUUID->"metadata-description"],

Cell[BoxData[
RowBox[{
RowBox[{"extractImageMetadata", "[", "image_", "]"}], " ", ":=", " ",
RowBox[{"Module", "[",
RowBox[{
RowBox[{"{",
RowBox[{"dims", ",", " ", "type", ",", " ", "resolution", ",", " ", "fileSize"}], "}"}], ",", "\n", " ",
RowBox[{
RowBox[{"dims", " ", "=", " ",
RowBox[{"ImageDimensions", "[", "image", "]"}]}], ";", "\n", " ",
RowBox[{"type", " ", "=", " ",
RowBox[{"Head", "[", "image", "]"}]}], ";", "\n", " ",
RowBox[{"resolution", " ", "=", " ",
RowBox[{"ImageResolution", "[", "image", "]"}]}], ";", "\n", " ",
RowBox[{"fileSize", " ", "=", " ",
RowBox[{"ByteCount", "[", "image", "]"}]}], ";", "\n", " ",
RowBox[{"Association", "[", "\n", " ",
RowBox[{
RowBox[{"\"\<Width\>\"", " ", "->", " ",
RowBox[{"dims", "[",
RowBox[{"[", "1", "]"}], "]"}]}], ",", "\n", " ",
RowBox[{"\"\<Height\>\"", " ", "->", " ",
RowBox[{"dims", "[",
RowBox[{"[", "2", "]"}], "]"}]}], ",", "\n", " ",
RowBox[{"\"\<Type\>\"", " ", "->", " ",
RowBox[{"ToString", "[", "type", "]"}]}], ",", "\n", " ",
RowBox[{"\"\<Resolution\>\"", " ", "->", " ", "resolution"}], ",", "\n", " ",
RowBox[{"\"\<FileSize\>\"", " ", "->", " ", "fileSize"}]}], "\n", " ", "]"}]}]}], "\n", " ", "]"}]}]], "Input",
CellChangeTimes->{{3.962018340000000*^9, 3.962018400000000*^9}},
ExpressionUUID->"extract-metadata-function"]
}, Open ]],

Cell[CellGroupData[{

Cell["Capture pages", "Section",
CellChangeTimes->{{3.961967124424393*^9, 3.961967133263727*^9},
3.961977211620045*^9},ExpressionUUID->"50116fd0-d927-8e4b-94f2-\
992f6543a073"],

Cell[BoxData[
RowBox[{
RowBox[{"func", "[", "lista_", "]"}], " ", ":=", " ",
RowBox[{"Module", "[",
RowBox[{
RowBox[{"{",
RowBox[{
"sesion", ",", " ", "loginURL", ",", " ", "email", ",", " ",
"password"}], "}"}], ",", "\n", " ",
RowBox[{"{",
RowBox[{
"sesion", ",", " ", "loginURL", ",", " ", "email", ",", " ",
"password", ",", " ", "metadataList"}], "}"}], ",", "\n", " ",
RowBox[{"(*",
RowBox[{
"Open", " ", "the", " ", "url", " ", "of", " ", "the", " ", "page"}],
"*)"}], "\n", " ",
RowBox[{
RowBox[{"sesion", " ", "=", " ",
RowBox[{"StartWebSession", "[",
RowBox[{"\"\<Firefox\>\"", ",", " ",
RowBox[{"Visible", " ", "->", " ", "False"}]}], "]"}]}], ";", "\n",
" ",
RowBox[{
RowBox[{"metadataList", " ", "=", " ",
RowBox[{"{", "}"}]}], ";", "\n", " ",
RowBox[{"sesion", " ", "=", " ",
RowBox[{"StartWebSession", "[",
RowBox[{"\"\<Firefox\>\"", ",", " ",
RowBox[{"Visible", " ", "->", " ", "False"}]}], "]"}]}], ";", "\n",
" ",
RowBox[{
"loginURL", " ", "=", " ", "\"\<https://archive.org/account/login\>\""}],
";", "\n", " ",
Expand Down Expand Up @@ -286,13 +331,13 @@ click();\>\""}]}], "]"}], ";", "\n", " ",
RowBox[{
RowBox[{"Module", "[",
RowBox[{
RowBox[{"{",
RowBox[{
RowBox[{"x", " ", "=", " ", "#"}], ",", " ", "pags", ",", " ",
RowBox[{"tiem", " ", "=", " ", "False"}], ",", " ",
RowBox[{"intentos", " ", "=", " ", "0"}], ",", " ",
RowBox[{"maxIntentos", " ", "=", " ", "40"}], ",", " ", "foto",
",", " ", "res"}], "}"}], ",", " ",
RowBox[{"{",
RowBox[{
RowBox[{"x", " ", "=", " ", "#"}], ",", " ", "pags", ",", " ",
RowBox[{"tiem", " ", "=", " ", "False"}], ",", " ",
RowBox[{"intentos", " ", "=", " ", "0"}], ",", " ",
RowBox[{"maxIntentos", " ", "=", " ", "40"}], ",", " ", "foto",
",", " ", "res", ",", " ", "metadata1", ",", " ", "metadata2"}], "}"}], ",", " ",
RowBox[{
RowBox[{"pags", " ", "=", " ",
RowBox[{
Expand Down Expand Up @@ -341,22 +386,83 @@ return img.src && img.complete && img.offsetWidth>0 && img.offsetHeight>0;});\
",", " ",
"\"\<No captura, la p\[AAcute]gina no carg\[OAcute]\>\""}],
"]"}]}], ";", "\n", " ",
RowBox[{"res", " ", "=", " ",
RowBox[{"{",
RowBox[{
RowBox[{"ImageTake", "[",
RowBox[{"foto", ",", " ",
RowBox[{"{",
RowBox[{"180", ",", " ", "2980"}], "}"}], ",", " ",
RowBox[{"{",
RowBox[{"400", ",", " ", "2220"}], "}"}]}], "]"}], ",", " ",
RowBox[{"ImageTake", "[",
RowBox[{"foto", ",", " ",
RowBox[{"{",
RowBox[{"180", ",", " ", "2980"}], "}"}], ",", " ",
RowBox[{"{",
RowBox[{"2180", ",", " ", "3960"}], "}"}]}], "]"}]}],
"}"}]}], ";", "\n", " ",
RowBox[{"res", " ", "=", " ",
RowBox[{"{",
RowBox[{
RowBox[{"ImageTake", "[",
RowBox[{"foto", ",", " ",
RowBox[{"{",
RowBox[{"180", ",", " ", "2980"}], "}"}], ",", " ",
RowBox[{"{",
RowBox[{"400", ",", " ", "2220"}], "}"}]}], "]"}], ",", " ",
RowBox[{"ImageTake", "[",
RowBox[{"foto", ",", " ",
RowBox[{"{",
RowBox[{"180", ",", " ", "2980"}], "}"}], ",", " ",
RowBox[{"{",
RowBox[{"2180", ",", " ", "3960"}], "}"}]}], "]"}]}],
"}"}]}], ";", "\n", " ",
RowBox[{"(*", " ",
RowBox[{"Extract", " ", "metadata", " ", "for", " ", "each", " ", "image"}], " ", "*)"}], "\n", " ",
RowBox[{"If", "[",
RowBox[{
RowBox[{"StringQ", "[", "foto", "]"}], ",", " ",
RowBox[{
RowBox[{"metadata1", " ", "=", " ",
RowBox[{"Association", "[",
RowBox[{
RowBox[{"\"\<Page\>\"", " ", "->", " ",
RowBox[{"ToString", "[", "x", "]"}]}], ",", " ",
RowBox[{"\"\<ImageType\>\"", " ", "->", " ", "\"\<Left\>\""}], ",", " ",
RowBox[{"\"\<Status\>\"", " ", "->", " ", "\"\<Failed\>\""}], ",", " ",
RowBox[{"\"\<Width\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Height\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Type\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Resolution\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<FileSize\>\"", " ", "->", " ", "\"\<N/A\>\""}]}], "]"}]}], ";", "\n", " ",
RowBox[{"metadata2", " ", "=", " ",
RowBox[{"Association", "[",
RowBox[{
RowBox[{"\"\<Page\>\"", " ", "->", " ",
RowBox[{"ToString", "[",
RowBox[{"x", " ", "+", " ", "1"}], "]"}]}], ",", " ",
RowBox[{"\"\<ImageType\>\"", " ", "->", " ", "\"\<Right\>\""}], ",", " ",
RowBox[{"\"\<Status\>\"", " ", "->", " ", "\"\<Failed\>\""}], ",", " ",
RowBox[{"\"\<Width\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Height\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Type\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<Resolution\>\"", " ", "->", " ", "\"\<N/A\>\""}], ",", " ",
RowBox[{"\"\<FileSize\>\"", " ", "->", " ", "\"\<N/A\>\""}]}], "]"}]}]}], ",", "\n", " ",
RowBox[{
RowBox[{"metadata1", " ", "=", " ",
RowBox[{"Join", "[",
RowBox[{
RowBox[{"Association", "[",
RowBox[{
RowBox[{"\"\<Page\>\"", " ", "->", " ",
RowBox[{"ToString", "[", "x", "]"}]}], ",", " ",
RowBox[{"\"\<ImageType\>\"", " ", "->", " ", "\"\<Left\>\""}], ",", " ",
RowBox[{"\"\<Status\>\"", " ", "->", " ", "\"\<Success\>\""}]}], "]"}], ",", " ",
RowBox[{"extractImageMetadata", "[",
RowBox[{"res", "[",
RowBox[{"[", "1", "]"}], "]"}], "]"}]}], "]"}]}], ";", "\n", " ",
RowBox[{"metadata2", " ", "=", " ",
RowBox[{"Join", "[",
RowBox[{
RowBox[{"Association", "[",
RowBox[{
RowBox[{"\"\<Page\>\"", " ", "->", " ",
RowBox[{"ToString", "[",
RowBox[{"x", " ", "+", " ", "1"}], "]"}]}], ",", " ",
RowBox[{"\"\<ImageType\>\"", " ", "->", " ", "\"\<Right\>\""}], ",", " ",
RowBox[{"\"\<Status\>\"", " ", "->", " ", "\"\<Success\>\""}]}], "]"}], ",", " ",
RowBox[{"extractImageMetadata", "[",
RowBox[{"res", "[",
RowBox[{"[", "2", "]"}], "]"}], "]"}]}], "]"}]}]}]}], "]"}], ";", "\n", " ",
RowBox[{"AppendTo", "[",
RowBox[{"metadataList", ",", " ", "metadata1"}], "]"}], ";", "\n", " ",
RowBox[{"AppendTo", "[",
RowBox[{"metadataList", ",", " ", "metadata2"}], "]"}], ";", "\n", " ",
RowBox[{"Export", "[",
RowBox[{
RowBox[{"\"\<hoj\>\"", " ", "<>", " ",
Expand All @@ -373,7 +479,15 @@ return img.src && img.complete && img.offsetWidth>0 && img.offsetHeight>0;});\
RowBox[{"res", "[",
RowBox[{"[", "2", "]"}], "]"}]}], "]"}], ";", "\n", " ",
RowBox[{"Remove", "[", "res", "]"}]}]}], "]"}], " ", "&"}], ",",
"\n", " ", "lista"}], "]"}]}]}], "\n", " ", "]"}]}]], "Input",
"\n", " ", "lista"}], "]"}], ";", "\n", " ",
RowBox[{"(*", " ",
RowBox[{"Export", " ", "metadata", " ", "to", " ", "CSV"}], " ", "*)"}], "\n", " ",
RowBox[{"Export", "[",
RowBox[{"\"\<book_images_metadata.csv\>\"", ",", " ", "metadataList"}], "]"}], ";", "\n", " ",
RowBox[{"Print", "[",
RowBox[{"\"\<Metadata exported to book_images_metadata.csv with \>\"", " ", "<>", " ",
RowBox[{"ToString", "[",
RowBox[{"Length", "[", "metadataList", "]"}], "]"}], " ", "<>", " ", "\"\< records\>\""}], "]"}]}]}], "\n", " ", "]"}]}]], "Input",
CellChangeTimes->{{3.961980875015821*^9, 3.961980883823658*^9}, {
3.9619810994615*^9, 3.961981109756647*^9}, {3.962018269104084*^9,
3.9620182737440205`*^9}},ExpressionUUID->"7129dd94-e5f2-324b-b994-\
Expand Down