Now that we can download our test page we need to parse the HTML so that it can be read easily by our program. This tutorial will give you a basic overview of the python module Beautiful Soup.
Now that we have the beginnings of a web scraper we need to test that it works. The following site is a test site for web scraping.
Our download function currently doesn’t do much in the way of retying downloads. In this next part we will add in some code to make our function try and download the page 3 times if it fails.
In the next part of this series we will deal with downloading a page with options for using proxies or custom user agents.
Web scraping is a very useful technique and python can make it really easy to do. In this part of the series I will deal with downloading pages and extracting the HTML.
Pydio stands for “Put You Data in Orbit” which essentially means that your data is accessible from the web. This is especially useful if your server isn’t easily accessible.
A class in Python is an object. It can have attributes and methods and each object is different to every other object. For this reason classes can be extremely useful in your programs.