Now that we can download our test page we need to parse the HTML so that it can be read easily by our program. This tutorial will give you a basic overview of the python module Beautiful Soup.
Before we do anything else we need to install the BeautifulSoup module as this is what will allow us to parse HTML.
sudo -H pip install bs4 sudo apt-get install python-lxml
The site that I will be scraping from can be found here. I have downloaded a copy of the webpage so I don’t have to download it every time I run the program.
#!/usr/bin/env python from bs4 import BeautifulSoup def openFile(): """ Returns the HTML in the webscraping.html file """ htmlFile = open("webscraping.html", "r") html = htmlFile.read() htmlFile.close() return html html = openFile() soup = BeautifulSoup(html, "lxml")
The above code will begin by opening a HTML file called webscraping.html, read the contents and then close the file again. The code then creates a soup from the HTML file. A soup is a BeautifulSoup object. To print this you can use the method .prettify() to make the HTML readable. Being able to print the HTML is all well and good but what we actually want is to be able to pull data from them. We don’t want to print the whole thing so lets just print the first 300 characters.
We can search the “soup” for tags using the find_all function. This function will return a list that contains every instance of the specified tag found in the HTML. So say we want to find all of the links in HTML?
links = soup.find_all("a") for link in links: if link is not None: print link
With the above code you will see that all the links in the page have been found and printed. However, we aren’t interested in all the links. What we really want it just the names of the products on the page. If you look at each product, they all have one thing in common. They all have the title class. Using this we can fine tune our find_all function.
links = soup.find_all("a", class_="title")
By replacing the existing line with this, we will only get the products that we are interested in. When you run the new code you can still see the HTML tag and attributes. To help get rid of this we use the .contents function. This function allows us to get the contents of the tag e.g. anything between the two tags. Because we know that there is nothing inside the tag other than the product name we can use the below line to get just that.
The reason the  is required is that the .contents function returns an array. This array only has one item in it for this example so we want the first item which has an index of 0.
In the next post we will see about creating ourselves a dictionary of products using details on the page such as price and processor type.