Interact with the Web Using Python and the HTTP Library

Date Published: 25/05/2009 11:41

The Internet has become so ingrained in modern life that it is now possible to find almost anything within its pages. These pages are of course written in such a way that they can reach many different people on many different platforms. This common standard (or as standard as its going to get anyway) is connecting people who would otherwise be incompatible and as application developers we can take advantage of that. I have recently been working with Python's HTTP library (httplib) which can be used to make HTTP requests to servers on the net and return information in the very same way your common browser does. This article looks at how you can use httplib to add some extra functionality to your applications and scripts by integrating them with the world wide web.

Why Would I Want to Access Web Pages from a Python Script?

There are various potential uses for this technology depending on what it is you are trying to achieve. I personally have used it to make POST requests to scripts on other systems and for collecting and parsing RSS feeds. How you use this technology is based on what it is you are trying to achieve. You can use it to include RSS feeds in your application, access external web pages containing valuable information and even create a web spider to test your site or search other sites.

Making a Basic GET Request

Before we can start firing off requests across the web we need to include the HTTP library (httplib).

from httplib import *

Next we must connect to the domain to which we shall be making requests by creating a HTTPConnection object. The domain must be specified as a string parameter when constructing the connection object and must exclude the http prefix (âhttp://â). Other optional parameters are the port (if not specified the system default port will be used) and (as of Python 2.6+) timeout which is the number of seconds required to timeout.

connection = HTTPConnection("example.com")

Now we have made our connection we can start making requests. In this simple example I am just dealing with GET requests but later I will handle using POST requests. Requests are made using the request() method of the HTTPConnection object and take two required parameters, the first being the request method (GET, POST, HEAD etc) and the second being the URL relative to the domain. Here I will just make a GET request to the index page of the site.

connection.request("GET", "/index.php")

With the request made we need to get the response and handle it. In this example I am going to check the status code to ensure the page was found and then output the request body accordingly. The status code can be accessed as a public variable of the HTTPResponse object.

response = connection.getresponse()

if response.status == 200:
   print "Page Found Successfully, Outputting Request Body"
   print response.read()
elif response.status == 404:
   print "Page Not Found"
else:
   print response.status, response.reason

With the request made and the output handled we can close the connection.

connection.close()

Making a Basic POST Request

The POST request syntax is of course very similar to the GET syntax so I will just cover the main differences. First it is wise to import the URL library (urllib) which can be used to encode the parameters we will be sending in the request.

from httplib import *
from urllib import *

Next we form the connection to the domain where we will be directing our requests.

connection = HTTPConnection("example.com")

Before we can make the request we must set up some headers and some parameters. The headers are used to tell the HTTP server what content our request includes. The parameters are what we are wanting to POST to the server. Both headers and parameters are passed as python dictionaries with the key representing the header or parameter name. In this example I will be passing just URL encoded plain text which is stated in the headers dictionary. To URL encode the parameters dictionary we can use the urlencode() method which in in the URL library we just imported.

head = {"Content-Type" : "application/x-www-form-urlencoded", "Accept" : "text/plain"}
parameters = urlencode({"strUserName" : "username", "strPassword" : "userpasswd"})

connection.request("POST", "/index.php", parameters, head)

Now we have sent the POST request we can handle its response in the same way we did the GET request and then close the connection.

response = connection.getresponse()

if response.status == 200:
   print "Page Found Successfully, Outputting Request Body"
   print response.read()
elif response.status == 404:
   print "Page Not Found"
else:
   print response.status, response.reason

connection.close()

Now we have established the basics of making HTTP requests using Python's HTTP library and we can move on to looking at some extra issues you may want to take into consideration.

Handling Compressed Data

A lot of web servers compress plain text to reduce bandwidth usage therefore here is a quick bit of code which you can use to decompress this data. This is important because by allowing servers to gzip their content we can reduce the amount of bandwidth we use while accessing their content which helps both them and ourselves.

I will start out with a demonstration of using the gzip module of Python and then I will put it all together with the stuff from above. First we have to import the gzip library along with the StringIO library which will be used to act as a file which can be decompressed.

from StringIO import *
from gzip import *

Next we can take the body content of the HTTPResponse object returned when the HTTP request is sent and feed it into a StringIO object. This StringIO object can then be decrompressed with the gzip library's GzipFile() method which creates an object which can then output the decompressed content using the read() method.

raw_data = response.read()
stream = StringIO(raw_data)
decompressor = GzipFile(fileobj=stream)
data = decompressor.read()

Using compression does cost more in processing power but the advantage of reduced bandwidth usage far outweighs this small disadvantage. In order for a server to transmit gzipped data you must first set the "Accept_Encoding" header which is covered in the next section. Here is an example using a GET request to receive data which is gzipped.

# /usr/bin/python

from httplib import *
from urllib import *
from StringIO import *
from gzip import *

connection = HTTPConnection("example.com")
head = {"Accept-Encoding" : "gzip,deflate", "Accept-Charset" : "UTF-8,*"}
connection.request("GET", "/index.php", headers=head)
response = connection.getresponse()

if response.status == 200:
   print "Page Found Successfully, Outputting Request Body"
   raw_data = response.read()
   stream = StringIO(raw_data)
   decompressor = GzipFile(fileobj=stream)
   print decompressor.read()
elif response.status == 404:
   print "Page Not Found"
else:
   print response.status, response.reason

connection.close()

Utilizing Request Headers

A common feature of more popular web sites is for the server to deny requests which appear to be automated based on the header they send. Google regularly do this to stop automated systems from interacting with their search engine and its results. These measures are in place for good reason of course and this should never be abused but you may want to modify your headers to better represent you application's intent. To assist you with this I have formed this list of headers which you may want to consider adding to your application.

  • User-Agent - Specifying your user agent informs the server what you are accessing their site from. A lot of spam bots hide themselves by masquerading as a common user agent but I would recommend giving yourself a custom user agent which can be used to identify you amongst the rest (e.g. Mozilla/5.0 (ExampleSpider "http://www.example.com/spider.html").
  • Accept - Use this to tell the server what mime types you are prepared to accept as servers may serve relevant data based on this header (e.g. text/html,application/xhtml xml,application/xml)
  • Accept-Language - This header is used more amongst multi-national brands than people are aware. A lot of servers redirect you to the localised version of their site based on this header as opposed to purchasing a database of IP address locales (e.g. en-gb,en).
  • Accept-Encoding - This defines whether the client making the request can handle compression techniques such as gzip which are used to reduce bandwidth usage (gzip,deflate).
  • Accept-Charset - This states which character sets are supported by the client (e.g. UTF-8,*).

All these headers may be used by web servers to manage what content they serve to their users therefore it is important to ensure you handle these accurately based on your needs.

Conclusion

There are so many potential uses for this technology you really have the freedom to do with it as you please. Some people use it for accessing data which would not otherwise be accessible but remember that if you can interact with that data in a more native manner it will probably be more secure and efficient doing it that way than accessing it via a HTTP server. Use this with caution and respect those whose server/data you are accessing. If you do not have explicit permission to access their content using an automated system make sure you abide by rules laid out in the robots.txt on their server and do not abuse their bandwidth/servers. Experiment and see what new exciting features you can provide for your users with web integration.

Comments

Sorry comments are currently disabled for maintenence

5 Most Recent Articles

Manually Triggering Events in ASP.NET from JavaScript

A quick guide for ASP.NET developers on how to manually trigger ASP.NET events from JavaScript.

Advanced Use of MySQL Stored Procedures

An article for users of MySQL databases describing how they can use advanced stored procedures to improve efficiently in their applications.

Using MySQL Stored Procedures and Extending MySQLi in PHP

A guide for LAMP developers to using stored procedures in MySQL and extending the MySQLi class.

Reading and Writing to Excel Spreadsheets in Python

An introduction to using the xlwt and xlrd modules for python to interact with Microsoft Excel spreadsheets.

Interact with the Web Using Python and the HTTP Library

This is an introduction to making HTTP requests from a python script/application using httplib.

Sponsors