https://wordpress.org/

Web Scraping: The easy way to collect and structure data from the Web

Published on 02/09/2022
Partager sur :

Have you ever been in a situation where you need to analyze tons of data from a website? Then, you may have encountered some blocking factors, such as an overabundance of data on too many web pages to be retrieved manually or/and completely unstructured data when trying to copy/paste them in an Excel spreadsheet, for instance.

The data are there, you can clearly see them, but you cannot use them in an efficient way. This kind of situation may be very frustrating!

You will be happy to learn that the Web Scraping technique can easily help you solve your problem!

The good news is that you do not need to be a professional hacker or have a Master’s Degree in IT to do Web Scraping. There are many tools available on the market, and you can easily start with the basics – already very useful – by freely installing the Chrome ‘Web Scraper’ extension on your laptop:

  • Go to ‘Customize and Control Google Chrome’ (bottom right of any Chrome page) > More Tools > Development Tools (or Ctrl + Maj + I).
  • You should be able to see the option ‘Web Scraper’ in the toolbar of your Dev Tools on Chrome.
Note: If the Dev Tools are not displayed at the bottom of your Chrome page, you can change the display by going to ‘Customize and Control Dev Tools’ (bottom right of your Dev Tools Toolbar) > Dock side à Choose your display.

Now you are ready to start Web Scraping!

As you may know, most websites are built as a ‘Tree structure’, meaning that you have an URL root redirecting you to multiple web pages containing a lot of data that can be in different types, such as:

The web scraping tool will give you the possibility to select the data that you want to extract and repeat the operation on every page supported by the URL root.

Let’s start with a simple, practical example: your client requested an analysis of the average cost per brand of the School Bags proposed on Amazon’s website. There are dozens of pages in the category ‘school bags’ with an average of 60 products presented on every page.

  • First, you have to create a Sitemap, which is the structure (or ‘map’) of the pages and information that your Web Scraper will go through to extract the data:
    • Go to Amazon’s webpage 1 presenting the School Bags:
    • Open your Dev Tools and click on your ‘Web Scraper’ option.
    • Click on ‘Create new Sitemap’.
    • Name your Sitemap. In this case, let’s call it ‘amazon_bag’.
      • Note: The name of your Sitemap cannot contain any upper case.
    • In Start URL, copy/paste the URL of the current webpage that will be your URL root for this Sitemap.
    • Click on ‘Create Sitemap’.
  • Now you can create your Selectors, which are the elements that the Web Scraper will use to interact with the website.
    Let’s start with the ‘page’ Selector that will guide your Web Scraper to navigate from page to page:
  • Click on ‘Add new Selector’.
  • In Id, give a name to your Selector, for instance ‘page’.
  • As Type of Selector, select the ‘Pagination (Beta)’.
  • In Selector:
    • Click on ‘Select’. A selection command should appear at the top of the Dev Tools toolbar.
    • Then click on ‘Next’ (page) on Amazon’s webpage. If done correctly, the element ‘Next’ will be highlighted in Red.
    • Click on ‘Done selecting’ (green button)
  • In Parent Selectors, make sure that both ‘_root’ and ‘page’ are selected.

Note: In this case, as we are going to create a Loop through the web pages by repeatedly going to the Next page (until the last one), it’s important to select ‘page’ in addition to ‘_root’ as Parent Selectors given that every new page will be the Child of the Previous page and the Parent of the Next page.

  • Click on ‘Save Selector’.

You just created a Loop through all Amazon’s webpages present in the category ‘School Bags’. You are on the good track!

  • You still need to select the data that you want to extract, and you will be ready to launch your Web Scraping.
    As for the ‘page’ Selector, you will need to build a ‘Data’ Selector:

As for the ‘page’ Selector, you will need to build a ‘Data’ Selector:

  • Click on ‘Add new Selector’.
  • In Id, name your Selector such as ‘product_description’.
  • Here comes the delicate part as you have to select:
    • The appropriate Type of your ‘Data’ Selector.

Unlike the ‘page’ Selector for which the Type is clearly defined in Chrome Web Scraper tool, your data can be dispatched and extracted in many formats: Text, Link, Image, Table, HTML … Depending on your needs and the Type that you select, it will be more or less complicated to use your data efficiently once they are extracted.

For this example, we will just define the Type as ‘Text’, meaning that the Web Scraper will only extract the Text information in the Selector elements that we are going to select.

  • The Selector itself, or more concretely, the representative elements that you want to extract from the webpages:
    • In Selector, click on ‘Select’. A selection command should appear at the top of the Dev Tools toolbar.
    • The tricky part is to click/select – more or less precisely – the elements that you want to extract on Amazon’s webpage. The first element(s) on which you click will be used as reference(s) by the Web Scraper to try finding the same (structure of) elements through the webpages.
      In our case, we are going to select the frame containing the image, description, price, etc. of some products (here, the bags) until ALL the frames on the current webpage are automatically highlighted in red by the Web Scraper.
      Note: Sometimes, even if it seems the same as the others from a visual point of view, the elements of some products may be structured in a specific way so that the Web Scraper would not be able to identify them. In this case, you can always add afterwards another Selector that will focus on the data extraction of these specific elements.
    • Click on ‘Done selecting’ (green button).

  • In Parent Selectors, as you will find data on the root page and the next pages, make sure that both ‘_root’ and ‘page’ are selected.
  • Click on ‘Save Selector’.

You just finalized your Sitemap!

Now you can launch the process of Web Scrapping itself:

  • Go to the Toolbar and click on ‘Sitemap amazon_bag’.
  • Click on Scrap: a configuration table with ‘Request interval (ms)’ and ‘Page load delay (ms)’ should appear:
Click on ‘Start Scraping’.

Note: By default, the interval between each Scraping request and the delay to load every page through the Loop process is defined as 2000 ms. Many websites are monitoring their traffic in order to identify and interrupt hacking or any other suspicious automatic activities that are typically faster than human abilities. This default configuration simulates the average speed of a normal user.

  • A new window showing Amazon’s web pages continuously updating should appear. That is the Web Scraper running its scraping program. Do not close it!
  • Once the program is done and the window is closed, you can click on ‘Sitemap amazon_bag’ > Extract data > Download as .XLSX or .CSV.

Now you should have a big Excel file with many data that you can use the way you like it.

If needed, do not hesitate to check our article ‘5 Excel Tips & Tricks you absolutely need’ to find some tips to make your input data usable.

Congratulations, you just run your first Web Scraping program!

For those of you who are wondering, Web Scraping is not illegal. However, you have to be careful not to use the collected data in an illegal way such as using non-transformed data from another website to build a similar commercial offer on your own website …

As a conclusion, we can say that Web scraping is a specialized technique to extract data from a website via a script or a program. The data are collected and exported in a structured way into a format that is more convenient to be used, depending on your objective (ex: .csv format for analytical purposes). Typically, this technique is especially useful in Bid comparison (prices, services), Lead generation, E-commerce, Website content crawling, Retail and Brand monitoring, Business Intelligence, Machine learning …

Now it is up to you to run it to discover all its specificities and apply it to your own needs!

Veuillez saisir votre adresse email pour vous abonner. Envoyer