Skip to content
Frédéric Vergnaud edited this page Feb 16, 2022 · 38 revisions

Presentation

Extractify is a free extension for Chromium, developed in JavaScript under Atom, whose purpose is to scrape structured data on the web. It is particularly designed for collecting comments or online conversations such as forums.

It allows you to:

  • Select structured information on a web page (like tables with rows and columns), by direct selection on the web page, or manual selection by entering HTML tags and related CSS code
  • Select the pagination of pages with the same structure and level
  • Repeat the process as many times as desired for lower levels
  • Scrape the whole selection
  • Finally, obtain a file in json format that can be easily imported in other software, in L@ME for example.

What it does not allow: everything else!

Manual installation for Chrome

  • Go to Releases to download the latest version and unzip it
  • In Chrome adress bar, go to extensions page by typing chrome://extensions/ and load the folder extractify as an unpacked extension

Usage

1. Introduction

From a technical point of view, Extractify uses the identification of HTML elements within the structure of a web page to extract the data. To find these HTML elements, called tags, Extractify relies on the CSS code (style sheets) that will manage their formatting (see the selection possibilities in appendices).

Conceptually, for Extractify, the level is the depth of a page. The page on which you are going to run the plugin for the first time will be level 0. The deeper pages that can be reached by hyperlinks will therefore be at levels -1, then -2, then -3 ...etc.

On each page, Extractify considers that the data you want to extract is structured in rows and columns, i.e. in the form of a table. You will therefore first have to select rows, then columns within these rows. Then, you can select pagination links, i. e. links to pages of equivalent depth to the observed page and structured in the same way as the observed page.

Finally, you can add a lower level by selecting links to this lower level that will be located within the lines already selected.

Extractify was initially developed to extract conversational data from online forums. Using such a typical example of its use, you will start by running Extractify on the page that will display all the forums to be scrapped as links. This page will be considered as level 0. Each link on this page, leading to each forum, will be a link to a lower level (level -1) of that forum. This lower level will display the topics. Then each link to each topic will itself be a link to a lower level (level -2) of that topic. This lower level will display the messages. Thus, scrapping a web page displaying links to forums will be like scrapping 3 levels of depth:

  1. Level 0 : the page displaying the forums, from which, in addition to the forum link, we can extract the forum title, its number of views, its number of responses...etc.
  2. Level -1 : the pages displaying the topics of these forums, from which, in addition to the topic links, we will be able to extract the title of the topic, its number of views, its number of responses ...etc.
  3. Level -2 : the pages displaying the messages of these topics, from which we can extract the author's name, the date of the message, the message ...etc.

2. Presentation of the interface

Interface diagram

3. Opening

Click on the plugin icon in the chromium extension bar:

Opening Extractify

The plugin opens on the dialog for adding level 0. You can:

  1. Add a level: then enter a level type (required) and click on "Select this level type":

Select a level type

  1. Or open a json file previously saved with Extractify: Click on "Cancel" then on "deJSONize" and select the desired file.

4. Adding rows

When you have created level 0, you can add a new line by clicking on "Add row".

Add a row

It opens the selection row dialog.

Add a row dialog

When the dialog opens, you can:

  1. Click directly on the "Add row" button to select the lines automatically: then move your mouse over the web page of your browser and select a line by clicking a highlighted area when hovering over the HTML structure of the page:

Select a row automatically

On click, the identical lines, i. e. lines that have the same HTML and CSS structure, are also highlighted:

Automatically selected row

And the Extractify interface is updated:

Automatically selected row : effects on Extractify

  1. Or Enter a CSS selector and click on "Add row" to select the lines manually. In this case, you can use the Chromium HTML inspector you find by right-clicking on the element you are looking for, then "Inspect":

Inspect webpage HTML and CSS structure

By hovering over the elements of the HTML structure on the left of the inspector, you identify the desired line and the CSS class that relies on it:

Localize HTML and CSS structure you target

In this example, the desired line is identified by the HTML tag <tr> whose CSS class is roweven (in HTML, we write the tag, then a dot, then the class name, ie here tr.roweven). All you have to do is enter this information in the Extractify interface and click on "Add row":

Enter HTML and CSS structure you target

Unfortunately, in this example, only every second line has been highlighted.

Result of HTML and CSS wrong targeting

The solution here is to use the CSS attribute selector class*= which means "whose class contains the text ...". You can now write :

Enter HTML and CSS structure you target

And all the desired lines are well highlighted.

Result of HTML and CSS targeting

5. Adding columns

Adding columns follows the same principle as adding rows, either automatically or manually. Click on the "Add column" button within a row line.

Adding colums

Then enter a column title.

Adding colums

Here you can:

  1. Click directly on the "Add column" button to select the columns automatically;
  2. Enter the CSS selector and click on "Add column" to select the columns manually.

Columns added

Columns added : effects on Extractify

6. Adding a lower level via depth links

In order to add a lower level, you must first collect links to this lower level: these are depth links. By clicking on the "Add depth link" button within a row line, the selection procedure is the same as for a column, in automatic or manual mode, except that you are restricted to selecting an hyperlink.

Add deeper links

Add deeper links

Add deeper links

Once you have collected your depth links, click on "Add level", taking care to select a url that will point to a page containing pagination links if you expect to find any:

Create a lower level from a hyperlink

Create a lower level from a hyperlink

Choose an url to a lower level among the collected urls, then click on "Select this url" to display the dialog for adding a lower level:

Create a lower level : choose the level type

Create a lower level : new level created

7. Adding a pagination

You may want to scrape pages of the same level, but located at different addresses accessible via pagination links. You can do this by clicking on "Select Pagination" which follows the same principle as adding rows or columns, either automatically or manually.

Add pagination

Add pagination

Add pagination

8. Adding a custom pagination

For some reason, pagination elements are difficult to select on a web page. It may be useful to use the custom pagination selection feature.

On a website, the pagination addresses are visible in the bottom left of your browser when you hover your mouse over the links.

Add custom pagination

You can also inspect the HTML structure using the inspector.

Add custom pagination

These addresses are mostly formed as follows:

  • an url in the form of a character string consisting of the protocol (in our example https://), the domain (forum.openstreetmap.org) and the path to the page to be viewed (/viewforum.php).
  • a variable indicating the page number to be displayed. In our example, it is p=2, p=3, p=221 ...etc.

Click on "Select a custom pagination".

Add custom pagination

The dialog of custom pagination selection opens. Entering a custom pagination selection works as follows.

You must enter :

  • the constant url : https://forum.openstreetmap.org/viewforum.php?id=56&p=
  • 4 stars representing the variable : ****
  • the start number of the pagination : 2
  • the step number between two pages : 1
  • the end number of the pagination : 9

Add custom pagination

9. Scraping

Extractify provides 2 options for scraping:

  • Request latency: it is possible that scraping a website overloads the server that hosts it due to a too high frequency of requests. This setting allows you to add a time gap between scraping requests.

  • Scrap pages in its own tab : on some dynamic websites, you have to scroll the page to display the whole information. Since Extractify's scraping is done by default in a new tab, which cancels a previous scroll, this option allows you to launch the scraping directly in the tab you used to make a selection by scrolling.

Scraping options

Once you have set the options, click on the Scrap button to scrap the information contained in the website elements you have selected.

Scrap

A new tab will open to allow you to monitor the progress of the scraping. At the same time, within Extractify you can see a counter of the objects you have defined.

Scrap

At the end of the scraping, the browser tab will close.

10. Scraping TIPS

  • Some pages may freeze when loading: refreshing or stopping the loading of the page will unblock the scraping.

  • Remember to save your selection work regularly by clicking on the JSONize button. This file can also be modified at will, then imported in Extractify using the DeJSONize button.

11. Appendices

Extractify supports CSS selector. See here for a complete description.

Selector overview

  • tagname: find elements by tag, e.g. a
  • #id: find elements by ID, e.g. #id_name
  • .class: find elements by class name, e.g. .class_name
  • [attribute]: elements with attribute, e.g. [href]
  • [^attr]: elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
  • [attr=value]: elements with attribute value, e.g. [width=500] (also quotable, like [data-name='launch sequence'])
  • [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]

Selector combinations

  • el#id: elements with ID, e.g. div#id_name
  • el.class: elements with class, e.g. div.class_name
  • el[attr]: elements with attribute, e.g. a[href]
  • Any combination, e.g. a[href].highlight
  • ancestor child: child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class "body"
  • parent > child: child elements that descend directly from parent, e.g. div.content > p finds p elements;
  • el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g. div.class_name, div.other_class_name

Pseudo selectors

  • :has(selector): find elements that contain elements matching the selector; e.g. div:has(p)
  • :not(selector): find elements that do not match the selector; e.g. div:not(.logo)
  • :contains(text): find elements that contain the given text. The search is case-insensitive; e.g. p:contains('hello world')