What is Python BeautifulSoup

Introduction to Python Web Scraping and the Beautiful Soup Library

Goal setting

Learn how to extract information from an HTML page using Python and the Beautiful Soup library.

requirement

  • Understand the basics of Python and object-oriented programming

Conventions

  • # - Requires that a specific Linux command be run with root privileges, either directly as the root user or using a command
  • $ - given Linux command to be executed as a regular non-privileged user

introduction

Web scraping is a technique that consists of extracting data from a website using special software. This tutorial will show you how to do basic web scraping using Python and the Beautiful Soup library. We will use the targeting of the homepage of Rotten Tomatoes, the famous aggregator of reviews and news for movies and TV shows, as a source of information for our exercise.

Installation of the Beautiful Soup library

To do our scraping we are going to use the Beautiful Soup Python library. So we need to install it first. The library is available in the repositories of all major GNU \ Linux distributions. Hence, we can install them using our preferred package manager or using the native Python method of installing packages.

If using the distribution package manager is preferred and we are using Fedora: $ sudo dnf install python3-beautifulsoup4 On Debian and its derivatives the package is called beautifulsoup4: $ sudo apt-get install beautifulsoup4 On Archilinux we can install it via pacman: $ sudo pacman -S python-beatufilusoup4 If we want to use instead we can just run: $ pip3 install --user BeautifulSoup4 If we run the above command with the flag we will install the latest version of the Beautiful Soup library just for our user. Therefore no root privileges are required. Of course, you can choose to install the package globally using pip, but personally I prefer per-user installations if you are not using the distribution package manager.

The BeautifulSoup object

Let's start: the first thing we want to do is create a BeautifulSoup object. The BeautifulSoup constructor accepts either a filehandle or a filehandle as the first argument. The latter interests us: we have the url of the page we want to scratch, so we use the library method (installed by default): this method returns a file-like object:

At this point our soup is ready: the object represents the document in its entirety. We can start navigating and extract the data we want using the built-in methods and properties. Suppose we want to extract all the links contained in the page: we know that links are represented by the tag in HTML and the actual link is contained in the attribute of the tag so we can use the method of the object we are using just built, to do our job:

With the method and specifying the name of the tag as the first argument, we searched for all the links on the page. For each link, we then retrieved and printed the value of the attribute. In BeautifulSoup, the attributes of an element are stored in a dictionary so it is very easy to retrieve. In this case we used the method, but we could also have accessed the value of the href attribute using the following syntax:. The full attribute dictionary itself is contained in the property of the element. The above code produces the following result: The list is much longer: The above is just an excerpt from the output, but it gives you an idea. The method returns all objects that match the specified filter. In our case, we just provided the name of the tag to match and no other criteria, so all links are returned. We'll see in a moment how we can narrow our search further.

A test case: retrieving all "Top Box Office" titles

Let's do some more limited scraping. For example, let's say we want to get all of the titles of the movies that appear in the Top Box Office section of the Rotten Tomatoes home page. The first thing we want to do is analyze the HTML page for this section: this way we can observe that the element we need is all contained in one element with the "top box office":

We can also observe that each row of the table contains information about a movie: the title scores are contained as text in an element with the class "tMeterScore" in the first cell of the row, while the string representing the title of the movie is contained is in the second cell as the text of the tag. Finally, the last cell contains a link with the text that represents the movie's box office results. With these references we can easily call up all the data we want:

The above code produces the following result: Crazy Rich Asians - $ 24.9M (TomatoMeter: 93%) The Meg - $ 12.9M (TomatoMeter: 46%) The Happytime Murders - $ 9.6M (TomatoMeter: 22%) Mission: Impossible - Fallout - $ 8.2M (TomatoMeter: 97%) Mile 22 - $ 6.5M (TomatoMeter: 20%) Christopher Robin - $ 6.4M (TomatoMeter: 70%) Alpha - $ 6.1M (TomatoMeter: 83%) BlacKkKlansman - $ 5.2 M (TomatoMeter: 95%) Slender Man - $ 2.9M (TomatoMeter: 7%) AXL - $ 2.8M (TomatoMeter: 29%) We introduced some new items, let's take a look at them. The first thing we did is get the ID 'Top-Box-Office' using the method. This method works similarly, but while the latter returns a list that contains the matches found, or is empty if there is no correspondence, the former always returns the first result or if an item with the specified criteria is not found.

The first element made available to the method in this case is the name of the tag to be taken into account in the search. As a second argument we passed a dictionary in which each key represents an attribute of the tag with its corresponding value. The key-value pairs given in the dictionary represent the criteria that must be met for our search to be a match. In this case we looked for the attribute with the value "Top-Box-Office". Note that since each must be unique on an HTML page, we could just omit the tag name and use this alternate syntax:

After getting our table object, we used the method to find all of the rows and iterate over them. We used the same principles to get the other elements. We also used a new method: it returns only the part of the text that is contained in a tag or, if none is given, the entire page. For example, knowing that the percentage of the music score is represented by the text contained in the element with the class, we used the method on the element to get it.

In this example, we've just displayed the retrieved data with very simple formatting. In a real-world scenario, we might want to do more manipulations or store them in a database.

Conclusions

In this tutorial, we've just scratched the surface of what we can do with Python and the Beautiful Soup library to do web scraping. The library contains many methods that you can use to refine your search or to help you navigate the page. I therefore strongly recommend consulting the very well-written official documents.

Something like that