Web Scraping 2: Downloading Pages

In the next part of this series we will deal with downloading a page with options for using proxies or custom user agents.

Before we can begin downloading pages we need a particular module called requests.

sudo -H pip install requests

The function we are about to write can go just beneath our checkUrl function.

def download(url, header = None, proxy = None, timeout = 5):
    """ Downloads a page using requests and returns the page content """
    try:
        page = requests.get(url, headers = header, proxies = proxy, timeout = timeout)
        return page.content
    except:
        return False

In its current state our download function will use the OS’ default settings to retreive the page. You can add extra parameters such as a custom header and a proxy. The proxy argument requires a dictionary following the below pattern.

proxy = {
    "http": <example url>,
    "https": <example url>
}

If you require a username and password for your proxy server then use the folling dictionary instead.

proxy = {
    "http": "http://<user>:<pass>@<hostname>",
    "https:" "https://<user>:<pass>@<hostname>
}

The custom header will require a new module to install.

sudo -H pip install fake-useragent

This module needs to be imported along with requests.

import fake_useragent

USERAGENT = fake_useragent.UserAgent()

We can now create a user agent by passing the user agent constant to the function. There are several different default user agents we can use. So pass the following to the function when you call it. (The best one to use is the USERAGENT.random option)

USERAGENT.opera # Opera/9.80
USERAGENT.chrome # Chrome/22.0.1216.0
USERAGENT.google # Chrome/24.0.1290.1
USERAGENT.firefox # Firefox/16.0.1
USERAGENT.ff # Firefox/15.0.1
USERAGENT.safari # Safari/8536.25
USERAGENT.random # Random user agent

Continue on to the next part of this tutorial where we will see about adding some retries to our download function.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s