-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Extractify is a free extension for Chromium, developed in JavaScript under Atom, whose purpose is to scrape structured data on the web. It is particularly designed for collecting comments or online conversations such as forums.
It allows you to:
- Select structured information on a web page (like tables with rows and columns), by direct selection on the web page, or manual selection by entering HTML tags and related CSS code
- Select the pagination of pages with the same structure and level
- Repeat the process as many times as desired for lower levels
- Scrape the whole selection
- Finally, obtain a file in json format that can be easily imported in other software, in L@ME for example.
What it does not allow: everything else!
- Go to Releases to download the latest version and unzip it
- In Chrome adress bar, go to extensions page by typing chrome://extensions/ and load the folder extractify as an unpacked extension
From a technical point of view, Extractify uses the identification of HTML elements within the structure of a web page to extract the data. To find these HTML elements, called tags, Extractify relies on the CSS code (style sheets) that will manage their formatting (see the selection possibilities in appendices).
Conceptually, for Extractify, the level is the depth of a page. The page on which you are going to run the plugin for the first time will be level 0. The deeper pages that can be reached by hyperlinks will therefore be at levels -1, then -2, then -3 ...etc.
On each page, Extractify considers that the data you want to extract is structured in rows and columns, i.e. in the form of a table. You will therefore first have to select rows, then columns within these rows. Then, you can select pagination links, i. e. links to pages of equivalent depth to the observed page and structured in the same way as the observed page.
Finally, you can add a lower level by selecting links to this lower level that will be located within the lines already selected.
Extractify was initially developed to extract conversational data from online forums. Using such a typical example of its use, you will start by running Extractify on the page that will display all the forums to be scrapped as links. This page will be considered as level 0. Each link on this page, leading to each forum, will be a link to a lower level (level -1) of that forum. This lower level will display the topics. Then each link to each topic will itself be a link to a lower level (level -2) of that topic. This lower level will display the messages. Thus, scrapping a web page displaying links to forums will be like scrapping 3 levels of depth:
- Level 0 : the page displaying the forums, from which, in addition to the forum link, we can extract the forum title, its number of views, its number of responses...etc.
- Level -1 : the pages displaying the topics of these forums, from which, in addition to the topic links, we will be able to extract the title of the topic, its number of views, its number of responses ...etc.
- Level -2 : the pages displaying the messages of these topics, from which we can extract the author's name, the date of the message, the message ...etc.

Click on the plugin icon in the chromium extension bar:
The plugin opens on the dialog for adding level 0. You can:
- Add a level: then enter a level type (required) and click on "Select this level type":

- Or open a json file previously saved with Extractify: Click on "Cancel" then on "deJSONize" and select the desired file.
When you have created level 0, you can add a new line by clicking on "Add row".

It opens the selection row dialog.

When the dialog opens, you can:
- Click directly on the "Add row" button to select the lines automatically: then move your mouse over the web page of your browser and select a line by clicking a highlighted area when hovering over the HTML structure of the page:

On click, the identical lines, i. e. lines that have the same HTML and CSS structure, are also highlighted:

And the Extractify interface is updated:

- Or Enter a CSS selector and click on "Add row" to select the lines manually. In this case, you can use the Chromium HTML inspector you find by right-clicking on the element you are looking for, then "Inspect":

By hovering over the elements of the HTML structure on the left of the inspector, you identify the desired line and the CSS class that relies on it:

In this example, the desired line is identified by the HTML tag <tr> whose CSS class is roweven (in HTML, we write the tag, then a dot, then the class name, ie here tr.roweven). All you have to do is enter this information in the Extractify interface and click on "Add row":

Unfortunately, in this example, only every second line has been highlighted.

The solution here is to use the CSS attribute selector class*= which means "whose class contains the text ...". You can now write :

And all the desired lines are well highlighted.

Adding columns follows the same principle as adding rows, either automatically or manually. Click on the "Add column" button within a row line.

Then enter a column title.

Here you can:
- Click directly on the "Add column" button to select the columns automatically;
- Enter the CSS selector and click on "Add column" to select the columns manually.


In order to add a lower level, you must first collect links to this lower level: these are depth links. By clicking on the "Add depth link" button within a row line, the selection procedure is the same as for a column, in automatic or manual mode, except that you are restricted to selecting an hyperlink.



Once you have collected your depth links, click on "Add level", taking care to select a url that will point to a page containing pagination links if you expect to find any:


Choose an url to a lower level among the collected urls, then click on "Select this url" to display the dialog for adding a lower level:


You may want to scrape pages of the same level, but located at different addresses accessible via pagination links. You can do this by clicking on "Select Pagination" which follows the same principle as adding rows or columns, either automatically or manually.



For some reason, pagination elements are difficult to select on a web page. It may be useful to use the custom pagination selection feature.
On a website, the pagination addresses are visible in the bottom left of your browser when you hover your mouse over the links.

You can also inspect the HTML structure using the inspector.

These addresses are mostly formed as follows:
- an url in the form of a character string consisting of the protocol (in our example
https://), the domain (forum.openstreetmap.org) and the path to the page to be viewed (/viewforum.php). - a variable indicating the page number to be displayed. In our example, it is
p=2,p=3,p=221...etc.
Click on "Select a custom pagination".

The dialog of custom pagination selection opens. Entering a custom pagination selection works as follows.
You must enter :
- the constant url :
https://forum.openstreetmap.org/viewforum.php?id=56&p= - 4 stars representing the variable :
**** - the start number of the pagination :
2 - the step number between two pages :
1 - the end number of the pagination :
9

Extractify provides 2 options for scraping:
-
Request latency: it is possible that scraping a website overloads the server that hosts it due to a too high frequency of requests. This setting allows you to add a time gap between scraping requests. -
Scrap pages in its own tab: on some dynamic websites, you have to scroll the page to display the whole information. Since Extractify's scraping is done by default in a new tab, which cancels a previous scroll, this option allows you to launch the scraping directly in the tab you used to make a selection by scrolling.

Once you have set the options, click on the Scrap button to scrap the information contained in the website elements you have selected.

A new tab will open to allow you to monitor the progress of the scraping. At the same time, within Extractify you can see a counter of the objects you have defined.

At the end of the scraping, the browser tab will close.
-
Some pages may freeze when loading: refreshing or stopping the loading of the page will unblock the scraping.
-
Remember to save your selection work regularly by clicking on the
JSONizebutton. This file can also be modified at will, then imported in Extractify using theDeJSONizebutton.
Extractify supports CSS selector. See here for a complete description.
- tagname: find elements by tag, e.g.
a - #id: find elements by ID, e.g.
#id_name - .class: find elements by class name, e.g.
.class_name - [attribute]: elements with attribute, e.g.
[href] - [^attr]: elements with an attribute name prefix, e.g.
[^data-]finds elements with HTML5 dataset attributes - [attr=value]: elements with attribute value, e.g.
[width=500](also quotable, like [data-name='launch sequence']) - [attr^=value], [attr$=value], [attr*=value]: elements with attributes that start with, end with, or contain the value, e.g.
[href*=/path/]
- el#id: elements with ID, e.g.
div#id_name - el.class: elements with class, e.g.
div.class_name - el[attr]: elements with attribute, e.g.
a[href] - Any combination, e.g.
a[href].highlight - ancestor child: child elements that descend from ancestor, e.g.
.body pfinds p elements anywhere under a block with class "body" - parent > child: child elements that descend directly from parent, e.g.
div.content > pfinds p elements; - el, el, el: group multiple selectors, find unique elements that match any of the selectors; e.g.
div.class_name, div.other_class_name
- :has(selector): find elements that contain elements matching the selector; e.g.
div:has(p) - :not(selector): find elements that do not match the selector; e.g.
div:not(.logo) - :contains(text): find elements that contain the given text. The search is case-insensitive; e.g.
p:contains('hello world')