Web Scraping 5: Parsing HTML

Now that we can download our test page we need to parse the HTML so that it can be read easily by our program. This tutorial will give you a basic overview of the python module Beautiful Soup.

Before we do anything else we need to install the BeautifulSoup module as this is what will allow us to parse HTML.

sudo -H pip install bs4
sudo apt-get install python-lxml

The site that I will be scraping from can be found here. I have downloaded a copy of the webpage so I don’t have to download it every time I run the program.

#!/usr/bin/env python
from bs4 import BeautifulSoup

def openFile():
    """ Returns the HTML in the webscraping.html file """
    htmlFile = open("webscraping.html", "r")
    html = htmlFile.read()
    htmlFile.close()
    return html
html = openFile()
soup = BeautifulSoup(html, "lxml")

The above code will begin by opening a HTML file called webscraping.html, read the contents and then close the file again. The code then creates a soup from the HTML file. A soup is a BeautifulSoup object. To print this you can use the method .prettify() to make the HTML readable. Being able to print the HTML is all well and good but what we actually want is to be able to pull data from them. We don’t want to print the whole thing so lets just print the first 300 characters.

print soup.prettify()[:300]

We can search the “soup” for tags using the find_all function. This function will return a list that contains every instance of the specified tag found in the HTML. So say we want to find all of the links in HTML?

links = soup.find_all("a")
for link in links:
    if link is not None:
        print link

With the above code you will see that all the links in the page have been found and printed. However, we aren’t interested in all the links. What we really want it just the names of the products on the page. If you look at each product, they all have one thing in common. They all have the title class. Using this we can fine tune our find_all function.

links = soup.find_all("a", class_="title")

By replacing the existing line with this, we will only get the products that we are interested in. When you run the new code you can still see the HTML tag and attributes. To help get rid of this we use the .contents function. This function allows us to get the contents of the tag e.g. anything between the two tags. Because we know that there is nothing inside the tag other than the product name we can use the below line to get just that.

print item.contents[0]

The reason the [0] is required is that the .contents function returns an array. This array only has one item in it for this example so we want the first item which has an index of 0.

In the next post we will see about creating ourselves a dictionary of products using details on the page such as price and processor type.

Advertisements

2 thoughts on “Web Scraping 5: Parsing HTML”

  1. Nice guide! Some of those modules look pretty nifty. I’ll keep them in mind as I expand the scraper that I’m currently building. Are headers really necessary though?

    Like

    1. Sometimes headers are required to make your request more like a normal browser as some websites actively Vick requests that look like they come from bots

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s