Countless times I have bought a CD and ripped it to my music collection only to find that it has no meta data. No artist, no album and worst off all no album art (I like my album art alright). There are countless tools out there that could do this for me but I wanted one that I had made because why not?
Now that we can download our test page we need to parse the HTML so that it can be read easily by our program. This tutorial will give you a basic overview of the python module Beautiful Soup.
The optparse module is a great way for you to add some command line arguments to your python scripts. It has an easy syntax, inbuilt help option and loads of customization to your program.
Now that we have the beginnings of a web scraper we need to test that it works. The following site is a test site for web scraping.
Our download function currently doesn’t do much in the way of retying downloads. In this next part we will add in some code to make our function try and download the page 3 times if it fails.
In the next part of this series we will deal with downloading a page with options for using proxies or custom user agents.
Web scraping is a very useful technique and python can make it really easy to do. In this part of the series I will deal with downloading pages and extracting the HTML.