CodeIgniter is a powerful PHP framework with a very small footprint, built for developers who need a simple and elegant toolkit to create full-featured web applications.
Now that we can download our test page we need to parse the HTML so that it can be read easily by our program. This tutorial will give you a basic overview of the python module Beautiful Soup.
The optparse module is a great way for you to add some command line arguments to your python scripts. It has an easy syntax, inbuilt help option and loads of customization to your program.
Now that we have the beginnings of a web scraper we need to test that it works. The following site is a test site for web scraping.
Our download function currently doesn’t do much in the way of retying downloads. In this next part we will add in some code to make our function try and download the page 3 times if it fails.
In the next part of this series we will deal with downloading a page with options for using proxies or custom user agents.
Web scraping is a very useful technique and python can make it really easy to do. In this part of the series I will deal with downloading pages and extracting the HTML.