Many interesting services are available through web-sites. In most cases, manually accessing the sites is fine, but when we want to integrate the results of a service in a script, this is too cumbersome.
Similar to calling external programs from scripts, we want to be able to access a web-site from a script. This is especially useful when we need to lookup some information, needed for further processing, in a public database accessible through the web.
If you are lucky, you can get to the resource you need in an easy to parse format, e.g. XML, and with a structure where it is easy to extract relevant data. More often the resource is accessed through an HTML page, that can be more or less structured, and where the structure does nothing to help distinguish relevant from irrelevant information.
In this lecture, we cover how to extract information from HTML pages. Although this should always be your last resort--almost any other structured format is preferable--you will find that it is very often your only resort.
We start out with a very simple problem: we want to extract information about the course plan from the course web-pages.
The course plan page contains a table with information about: material covered, exercises, and additional information (notes or projects). We want to write a script that, given the week number, gives us the information for that particular week.
The problem can be broken into two sub-problems: downloading the web-page and extracting the relevant information.
The address of a web-page is in the form of a URI (Uniform Resource Identifier). For web-pages, the term URL (Uniform Resource Locater) is often used instead of URI, but see The W3C description of URIs and URLs for the full story. Here, we will follow convention and use URL.
The URL for the course plan page is http://www.daimi.au.dk/~chili/CSS/plan.html. The first part of the URL, the http: string, informs us that the page should be accessed over the HyperText Transfer Protocol. On the web, this is usually the protocol we use, but you will also see ftp: URLs (for the File Transfer Protocol), https: URLs (for Secure HTTP), mailto: URLs (for emails), and so forth.
The interpretation of the remainder of the URL depends on the protocol selected in the first part, but for HTTP, the next part of the URL, the www.daimi.au.dk string, is the address of the site where the page is located, and the last part, the ~chili/CSS/plan.html string, is the location of the page on that site.
The module urlparse can be used for manipulating URLs. In this course, we will not consider URLs in more detail.
Given the URL for the course plan, we can download the page using the urllib module:
import urllib
page = urllib.urlopen('http://www.daimi.au.dk/~chili/CSS/plan.html')
for line in page.readlines():
print line,
page.close()
This script opens a connection to the page, read all the lines in the page, and prints them out. (The "," in the print statement is used because the line already contains a newline character).
With the html-page in hand, it is now a matter of finding the relevant information.
Unfortunately, html is not particularly easy to extract information from; it is not as well-structured as XML (one of the reasons people are now moving towards using XHTML) and there is no semantics associated with the structure. Therefore, we cannot simply ask for the information we need. We need to figure out how the information is represented in the html (which often involves a bit of guessing), and then write a parser for extracting the information (and ignoring irrelevant data).
Parsers like these are notoriously unstable; when the html on the web-site changes (and at some point it will), the parser will no longer be able to extract the relevant information, and it has to be updated. This is far from optimal, but until a project such as the the semantic web is successful, it is probably all we've got.
But enough talk, let us examine the plan web page. If we examine the html source of the page, we see that the information we are interested in is found in a table (the only table on the page), and that the information is ordered in four rows, where the first row is the date of the lecture, the second row is the lecture session, the third row is the exercises, and the fourth row is for notes and projects.
<table align=center border=4 cellspacing=4 cellpadding=4> <tr> <td valign=top> </td> <td valign=top> Lecture sessions</td> <td valign=top> Exercises</td> <td valign=top> Notes/Projects </td> </tr> <tr> <td valign=top> Week 35<p> 28/8</td> <td valign=top> Introduction to the course.<br></td> <td valign=top> 5 push-ups.</td> <td valign=top> </td> </tr> ... </table>
This is easy to parse, just read one table row at a time, split it into the four columns, and there you are. This is not that different from the kinds of parsers that you have already written, and you should be able to handle this parser in a similar way.
We can get a bit of help from the HTMLParser module, however. The HTMLParser class from this module parses html files--to the extend that it recognizes tags, but not to the extend that it matches begin-/end-tags or recognizes implicit end tags (for this see htmllib, a more complicated module with a slightly better parser).
To use the HTMLParser, you must write a derived class of HTMLParser, where we override methods handle_starttag, handle_endtag, and handle_data. These methods will be called when the parser encounters a start tag, an end tag, and raw text, respectively. The derived class will let us respond to these events.
Just for kicks, consider the simple parser below, and run it on the course plan web-page. That should show you the information you get from the events generated by HTMLParser and delivered, through the overridden methods, to SimpleHTMLParser.
from HTMLParser import HTMLParser
class SimpleHTMLParser(HTMLParser):
def handle_starttag(self, tag, attr):
print 'begin', tag, attr
def handle_endtag(self, tag):
print 'end', tag
def handle_data(self, data):
print 'data', data
parser = SimpleHTMLParser()
import urllib
page = urllib.urlopen('http://www.daimi.au.dk/~chili/CSS/plan.html')
parser.feed(page.read())
page.close()
EXERCISE N.1: Run the above script and compare it with the html code on the course plan web-page. Can you relate the output of the script to the html code?
The simple parser above is not enough for our purposes. We want to use our parser for extracting the information in the table. To do this, we must perform some action when we see <tr>, </tr>, <td>, and </td> tags, and extract text data when inside a <td> ... </td> pair.
The parser below does exactly that; when it sees a tag it tests whether it is a td tag or a tr tag, and if so it performs an appropriate action, if it is not it simply ignores the tag.
The action performed when it sees a start-tag is to install a "handler" for block enclosed by the tag-pair. When it sees an end-tag, it removes the current handler--and there by restores the previous handler--and, in the case of a </td> tag, hands over the data collected to the handler for the enclosing tag-block (assumed to be a row handler).
When the parser receives data--in the handle_data method--it simply forwards it to the current handler.
from HTMLParser import HTMLParser
class CSSPlanParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
self.__stack = [AbstractHandler()]
def handle_starttag(self, tag, attr):
if tag == 'tr':
# make new row handler and put it on top of the stack
self.__stack.append(RowHandler())
elif tag == 'td':
# make new column handler and put it on top of the stack
self.__stack.append(ColHandler())
else:
pass # ignore all other tags
def handle_endtag(self, tag):
if tag == 'tr':
# at the end of a row we simply pop the row handler
self.__stack.pop()
elif tag == 'td':
# at the end of a column we give the column data to the row
col = self.__stack.pop()
self.__stack[-1].handle_col(col.get_data())
else:
pass # ignore all other tags
def handle_data(self, data):
# make top of stack handle the data
self.__stack[-1].handle_data(data)
The handlers used above are simply instances of classes we write, capable of responding to a certain set of events that our parser generates.
The parser expects handlers to respond to data and columns--the parser calls the methods handle_data and handle_col in handlers--so our handlers must implement these two methods.
An abstract handler, that does not do anything when handling data or columns, would look like this:
class AbstractHandler:
"""Handler used when we are not expecting real data."""
def __init__(self):
pass
def handle_data(self,data):
pass
def handle_col(self,col):
pass
A column handler should respond to handle_data by collecting the text and a row handler should respond to handle_col by collecting the columns.
The design using handlers like this is generally useful when you have an object that has to respond to a number of events--like our parser that has to respond to start-/end-tags and data--and where the response depends on the current state of the object--for example, whether we are inside a row, a column, or somewhere else.
Instead of testing for the state in each method, you can then just forward the event to the current handler, and update the handler when the state changes.
We use a stack of handlers in the parser, because of the nested nature of html.
A complete script for extracting weekly plans can be downloaded here: get-week-plan.py
EXERCISE N.2: Read and understand the get-week-plan.py script. Be sure that you understand how the handlers work.
EXERCISE N.3: Write a script that extracts the headings on this page (i.e. the <h1>...</h1>, <h2>...</h2>, and <h3>...</h3> blocks) and write them out indented in the right scope, that is, such that an h3 header is indented under the enclosing h2 header.
The exercises on this page are all found in paragraphs (<p>...</p> blocks) with the class attribute set to "exercises". We can use this information to extract all exercises, as shown in the parser below. (Get the full script here).
from HTMLParser import HTMLParser
class ExParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self)
global dummy
self.__handler = dummy
def handle_starttag(self, tag, attr):
if tag == 'p' and ('class', 'exercises') in attr:
self.__handler = ExHandler()
def handle_endtag(self, tag):
if tag == 'p':
global dummy
self.__handler.end()
self.__handler = dummy
def handle_data(self, data):
self.__handler.handle_data(data)
EXERCISE N.4: Based on the get-exercises.py script, write a script that extracts all code blocks from this page.
EXERCISE N.5: Based on the get-exercises.py script, write a script that extracts all Web-resources blocks from this page.
We now turn to a little more difficult parsing...
EXERCISE N.6: Write a script that extracts the names and email-addresses from this page. Notice that there are more than one table, and that the tables we are interested in are nested within another table. One way of handling this is to have two table-handlers, one for the outer table and one for the inner tables. The outer handler ignores everything but the appearance of a new table-block, where it instantiates an inner handler. The inner handler extracts information similarly to the handlers used above.
The situation in exercise N.6, where the data we are interested in is nested deep within the html structure, is very often the case. Especially since tables are widely used to layout web-pages.
Extracting the data would be a lot easier if we could start parsing at the right nesting-level, i.e., skip past the outer levels and only parse the relevant block.
<html>
...
<body>
...
<table>
....
<table>
<!-- only parse this part -->
</table>
....
</table>
...
</body>
</html>
If we write the block structure using slashes, we want to skip past /home/body/table/table/ and only parse the text inside the inner table.
We can write a handler that skips past outer blocks, and dispatches to our own handler (for parsing the inner data) only when inside the right block-nesting.
Such a handler can be in three different states:
Using the handler design, we can implement this using three sub-handlers; one for each state. The main handler can be integrated with the parser, as shown below:
from HTMLParser import HTMLParser
class PathHandler(HTMLParser):
"""Class for parsing sub-structures of an html document."""
def __init__(self, path, real_handler):
"""Initialise handler such that `real_handler' is called for
the parts of the document nested within `path'."""
HTMLParser.__init__(self)
# some initialisation goes here
def set_prefix(self):
self.__current_handler = self.__prefix_handler
def set_wrong_path(self):
self.__current_handler = self.__wrong_path_handler
def set_in_path(self):
self.__current_handler = self.__in_path_handler
def handle_starttag(self, tag, attr):
self.__current_handler.handle_starttag(tag,attr)
def handle_endtag(self, tag):
self.__current_handler.handle_endtag(tag)
def handle_data(self, data):
self.__current_handler.handle_data(data)
The main handler has methods for changing between the three states--the three set_handler methods--and for dispatching the events--the three handle_event methods.
Responding to events, and the actual changing between states, is handled by the sub-handlers.
The prefix handler keeps track of where we are on the path, and changes to either the in-path handler or the wrong-path handler when necessary:
class PrefixHandler(AbstractHandler):
"""Class handing the state where the PathHandler is in a prefix of
the path."""
def __init__(self,parser,path):
AbstractHandler.__init__(self)
self.__parser = parser
self.__path = path
self.__level = 0
def handle_starttag(self, tag, attr):
if tag == self.__path[self.__level]:
if self.__level == len(self.__path) - 1:
self.__parser.set_in_path()
else:
self.__level += 1
else:
self.__parser.set_wrong_path()
def handle_endtag(self, tag):
self.__level -= 1
The in-path handler dispatches events to the real handler, and keeps track of the nesting level, so it can change back to the prefix handler when necessary:
class InPathHandler(AbstractHandler):
"""Class handling the state where the PathHandler is inside the
path."""
def __init__(self,parser,real_handler):
AbstractHandler.__init__(self)
self.__parser = parser
self.__real_handler = real_handler
self.__level = 0
def handle_starttag(self, tag, attr):
self.__level += 1
self.__real_handler.handle_starttag(tag,attr)
def handle_endtag(self, tag):
if self.__level == 0:
self.__parser.set_prefix()
else:
self.__level -= 1
self.__real_handler.handle_endtag(tag)
def handle_data(self, data):
self.__real_handler.handle_data(data)
The wrong-path handler behaves similar to the in-path handler, except that it does not dispatch to the real handler.
You can download the full path-handler module here.
EXERCISE N.7: Redo exercise N.6 using the path handler.
EXERCISE N.8: Run your solution to exercise N.7 on this page. What happened?
The page in exercise N.8 contains more tables inside the same path as the relevant tables. The path handler will select all of them. Can we find some way of selecting the right tables, and none of the wrong tables, using a path?
EXERCISE N.9*: Extend the path syntax with a "number" expression: Make /html/body/table/tr/td/table[1]/ refer to the first table in the path /html/body/table/tr/td/, /html/body/table/tr[2]/td/table/ any table in the second row of tables in the path /html/body/table/, and so on.
EXERCISE N.10*: Extend the path syntax with an "attribute" expression: Make /html/body/table/tr/td/table[@border="1"]/ refer to tables in path /html/body/table/tr/td/ with attribute "border" set to "1".
EXERCISE N.11: Will the path-handler work on the real birc page? Explain.
We have learnt how to download and parse web-pages. With this new knowledge, we are now ready to attack this weeks exercises, concerning a module wrapping search on the NCBI web site.