Web Scraping 4: Testing The Download Function

Now that we have the beginnings of a web scraper we need to test that it works. The following site is a test site for web scraping.

http://webscraper.io/test-sites/e-commerce/allinone/computers/laptops

Our program currently should look a bit like this:

#!/usr/bin/env python
import urlparse2, requests, fake_useragent

USERAGENT = fake_useragent.UserAgent()
def validUrl(url):
    """ Returns True/False depending on whether the url is valid or not """
    parsed = urlparse.urlparse(url)
    if parsed.scheme:
        return True
    else:
        return False

def download(url, header = None, proxy = None, timeout = 5):
    """ Downloads a page using requests and returns the page content """
    try:
        page = requests.get(url, headers = header, proxies = proxy, timeout = timeout)
        return page.content
    except:
        return False

if __name__ == "__main__":
    print("Enter a URL")
    url = raw_input(">>> ")
    if validUrl(url):
        print("[+] URL is valid")
    else:
        print("[-] URL is invalid")
        raise SystemExit

    print download(url)[:300]

When you run the program you should get an output that looks a bit like the following:

Enter a URL
>>>

At this point you enter the URL at the top of the post and you should get an out put like this. This is the first 300 characters of the output (courtesy of the  [:300] at the end of our print statement). If you want to print all of it just remove the  [:300].

[+] URL is valid
<!DOCTYPE html>
<html>
<head>
    <title>Web Scraper</title>
    <meta charset="utf-8">
    <meta http-quiv="X-UA-Compatible" content="IE=edge,chrome=1">
    <meta name="keywords" content="web scraping,Web Scraper,Chrome extension,Crawling,Cross platform scraper, " />
    <meta name="description" content="Web Sc

To add a custom header, add this above the print line. This will make requests use a custom header to make it look more like a real web browser

header = { "User-Agent": USERAGENT.random }
print download(url, header = header)[:300]

If you want to specify just a proxy you need to change your function call to match. This is because the function expects the parameters to be a specific order and if you try to pass the proxy argument 2nd the function will use it as a header.

download(url, None, proxyServer)

Check out the next post in this tutorial to find out how to actually start parsing html.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s