Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Lecture Notes / Web-services

Web-services

In this lecture we cover a few aspects of web-services. There are many aspects of web-services, but most of the complexity is on the server side, which we will not cover in this course. Here we will only concern ourselves with client access to web-services. We will learn:

  • How to access web-pages from python.
  • How to extract information from an XML page.
  • How to parse an HTML page.

Supplementary reading: Python—how to program, 20.1 and 20.2, pages 689-692, 15.1-15.4, pages 491-500, 16.1-16.4, pages 529-534, the urllib manual and the sgmllib manual.

I realize that some of the material covered here, and in the associated exercises-note, have already been covered in Programming for Bioinformatics, but read on anyway, I promise there will also be material you have not seen before.

Motivation

Many interesting services are available through web-sites. In most cases, manually accessing the sites is fine, but when we want to integrate the results of a service in a script, this is too cumbersome.

Warning!
If you make extensive use of a service through a script, see if you cannot get to the service through a local stand-alone program. It is considered rather rude to have scripts performing massive requests of a site; the sites are usually only designed to cope with the workload created by interaction with humans, a script can create workloads far exceeding the site's capacity. You should only use scripts to access a web-site if the workload you create is rather small.

Similar to calling external programs from scripts, we want to be able to access a web-site from a script. This is especially useful when we need to lookup some information, needed for further processing, in a public database accessible through the web.

If you are lucky, you can get to the resource you need in an easy to parse format, e.g. XML, and with a structure where it is easy to extract relevant data. More often the resource is accessed through an HTML page, that can be more or less structured, and where the structure does nothing to help distinguish relevant from irrelevant information.

In this lecture, we cover how to extract information from HTML pages. Although this should always be your last resort — almost any other structured format is preferable — you will find that it is very often your only option.

The Course Announcements (Dealing with XML)

We start out with a very simple problem: we want to extract announcements for the course and information about the course plan from the course web-pages.

The announcements are shown, in a human readable form, on both the main page and the announcements page — with only the most recent announcements on the main page — and in a format more fitting for computer processing, on the Really Simple Syndication (RSS) feed.

The course schedule page contains a table with information about: material covered, exercises, and additional information (notes or projects). We want to write a script that, given the week number, gives us the information for that particular week.

The problem of extracting information from such web-pages can be broken into two sub-problems: downloading the web-page and extracting the relevant information.

Downloading a Web-Page

The address of a web-page is in the form of a URI (Uniform Resource Identifier). For web-pages, the term URL (Uniform Resource Locater) is often used instead of URI, but see The W3C description of URIs and URLs for the full story. Here, we will follow convention and use URL.

A URL is on the form scheme://location/path?query#fragment, but usually not all of the components, scheme, location, path, query and fragment, are present.

Missing parts are recognized by missing punctuation. For example, the URL http://www.birc.au.dk is missing the punctuation characters for the path, /, query, ?, and fragment, #, and thus only contains the scheme (http and location www.birc.dk. The URL mailto:mailund@mailund.dk is missing // and thus contains no location but only a path mailund@mailund.dk.

The URL for the course plan page is

http://www.birc.dk/~besen/TBiB2006/schedule.html

The first part of the URL, the scheme, is http which informs us that the page should be accessed over the HyperText Transfer Protocol. On the web, this is usually the protocol we use, but you will also see ftp URLs (for the File Transfer Protocol), https URLs (for Secure HTTP), mailto URLs (for emails), and so forth.

The www.bric.dk string is the location, the address of the site where the page is located, and the last part, the ~besen/TBiB2006/schedule.html string, is the path, the location of the page on that site.

For the course schedule, there is no query and fragment part.

The module urlparse can be used for manipulating URLs. In this course, we will not consider URLs in more detail, but in the exercises for this lecture we will use the query part of URLs to give parameters to web-services.

Given the URL for the course plan, we can download the page using the urllib module:

import urllib
url = 'http://www.birc.dk/~besen/TBiB2006/schedule.html'
page = urllib.urlopen(url)
print page.read()
page.close() 

Try running the script above. The output looks rather nasty, eh?

Don't worry, though, it is not as bad as it looks (in this particular case) and you will be able to parse it shortly, but we will first have a look at processing the announcements, since this is a slightly easier task.

Getting Announcements

With a web-page in hand — downloaded using urllib as above — it is now a matter of finding the relevant information. For the announcements, this is fairly easy. We can download them from the RSS feed from the URL http://www.birc.dk/~besen/TBiB2006/rss.xml. The RSS feed is in a simple XML format that is easy to extract information from.

We can start by downloading it and have a look at it:

import urllib
url = 'http://www.birc.dk/~besen/TBiB2006/rss.xml'
print urllib.urlopen(url).read()

This will, depending on which and how many announcements have been made, look something like this:

<?xml version="1.0"?>
<rss version="2.0">
  <channel>
    <title>Tool-Building in Bioinformatics 2006</title>
    <link>http://www.bric.dk/~besen/TBiB2006/index.html</link>
    <description>A course on script programming</description>
    <language>en-us</language>

    <item>
      <title>Pages available</title>
      <pubDate>06/03/2006</pubDate>
      <description>The initial WWW pages are ready.</description>
      <link>http://www.daimi.au.dk/~besen/TBiB2006/announcements.html</link>
    </item>

    <item>
      <title>Mandatory project 1</title>
      <pubDate>19/4/2006</pubDate>
      <description>The first Mandatory Project should be handed in Thursday,
      May 4th</description>
      <link>http://www.daimi.au.dk/~besen/TBiB2006/announcements.html</link>
    </item>
  </channel>
</rss>

The first part of the RSS describes the channel — the source of the announcements which in this case is the course main page. An RSS file can actually contain several channels, but in this case there is only one.

The channel also contains a list of news-items, in <item>-tags, and these are the announcements we are after. Each item contains a title, a date, a description and a link to further information (which in this case is not really that interesting since all the links points to the same page: the main announcements page).

Extracting Announcements

A simple way to process this, is through the Document Object Model (DOM) interface. This is a general interface for manipulating XML documents as were they trees; the interface lets us traverse or modify the tree, accessing subtrees and attributes of each node. This interface to the document is at a rather abstract level, but it is useful for building more application specific interfaces to XML documents.

We translate the raw text of the XML document into a DOM object using the xml.dom.minidom module; we can use either the parseString() method — to parse a string:

import urllib
import xml.dom.minidom

url = 'http://www.birc.dk/~besen/TBiB2006/rss.xml'
doc = xml.dom.minidom.parseString(urllib.urlopen(url).read())

or the parse() method — to parse the document from a stream, e.g. a file or a urllib connection:

import urllib
import xml.dom.minidom

url = 'http://www.birc.dk/~besen/TBiB2006/rss.xml'
doc = xml.dom.minidom.parse(urllib.urlopen(url)) 

With the DOM object in hand, we can manipulate the document tree, using the methods and attributes described in The W3C Document Object Model (DOM) Level 1 Specification or described in Python—How to Program pages 537-539.

Explicitly getting the items in the document could look like this:

# <rss></rss> is the first (and only) child of the root
rss = doc.childNodes[0]
# in this case, <rss></rss> contains only a single <channel></channel>
channel = rss.childNodes[0]
# and the items are the children of the channel with tag <item>
items = channel.getElementsByTagName('item') 

In this particular case, we could actually exploit that all the items in the document are the ones we are after, and just do

items = doc.getElementsByTagName("item")  

but in general we will have to get hold of the right sub-tree before we extract the sub-nodes.

This can be a bit of a pain, but luckily there is a way around it: using the XPath-language, which resembles regular expressions, but can be used to identify sub-parts of an XML document.

The xpath to the items we are after is this: /rss/channel[1]/item: moving from the root we want to enter the rss tag, then take the first channel, and get all the items that are children of that channel.

In our Python script, it would look like this:

import xml.xpath
items = xml.xpath.Evaluate('/rss/channel[1]/item', doc)  

In any case, however we obtain the list of items, we want to print the information they contain. How does this work?

for item in items:
    print item
<DOM Element: item at 0xb7c3048c>
<DOM Element: item at 0xb7c3064c>
<DOM Element: item at 0xb7c3080c>

Hmm, not so well.

The problem here is, that the items are DOM objects, and these are not necessarily printed in any meaningful format.

Let us instead extract the information contained in the items, and pretty-print each item title, together with the date of the announcement:

for item in items:
    print 'Title:', getTitle(item)
    print 'Date:', getDate(item)
    print

That was easy; now we just need to implement getTitle() and getDate().

To get the data we want, we need to access the title and pubDate sub-nodes of the items, and to extract the text they contain.

Getting the sub-nodes we already know how to do — we just use the getElementsByTagName() method.

To get the text, we need to know that a tag-pair that contains raw text — as is the case for the title and pubDate tags — contains a sub-node of type TEXT_NODE and this node stores the text in its data attribute.

Knowing this, we write:

def getTitle(item):
    # get the text-child of the title sub-node
    node = item.getElementsByTagName('title')[0].childNodes[0]
    # and then extract the text data
    assert node.nodeType == node.TEXT_NODE
    return node.data

def getDate(item):
    # get the text-child of the date sub-node
    node = item.getElementsByTagName('pubDate')[0].childNodes[0]
    # and then extract the text data
    assert node.nodeType == node.TEXT_NODE
    return node.data 

There is an ugly but if redundancy there — the two functions look almost identical — but we can solve that with a higher-order function:

def getText(tag):
    def extractTextData(item):
        node = item.getElementsByTagName(tag)[0].childNodes[0]
        assert node.nodeType == node.TEXT_NODE
        return node.data
    return extractTextData

getTitle = getText('title')
getDate = getText('pubDate') 

And, voila, we have a script for displaying the title of the announcements:

import urllib
import xml.dom.minidom
import xml.xpath

def getText(tag):
    def extractTextData(item):
        node = item.getElementsByTagName(tag)[0].childNodes[0]
        assert node.nodeType == node.TEXT_NODE
        return node.data
    return extractTextData

getTitle = getText('title')
getDate = getText('pubDate')

url = 'http://www.birc.dk/~besen/TBiB2006/rss.xml'
doc = xml.dom.minidom.parse(urllib.urlopen(url))
for item in xml.xpath.Evaluate('/rss/channel[1]/item', doc):
    print 'Title:', getTitle(item)
    print 'Date:', getDate(item)
    print 

EXERCISE WS.1: There are many sites on the need with an RSS feed. Not all of them use the same version of RSS (version 2.0) as we leaned how to parse here, but some do. At New York Times, for instance, there is a whole catalogue of feeds available from http://www.nytimes.com/services/xml/rss/ .

Select a few of these news feeds, and write a script that collects the titles and dates for these feeds.

Modify it so it also prints the link to the news article associated to each item.

Dealing with Namespaces

BMC Bioinformatics also have an RSS feed, at http://www.biomedcentral.com/bmcbioinformatics/rss. (If you view this feed in your browser, it will be formatted according to a style-sheet and you will not see the raw XML, but you can do that by viewing the page source).

This RSS feed is not RSS 2.0, but can you still extract the item titles and dates?

One would think that it was just a matter of using a different XPath, and slightly different tag-names for extracting the text, but for this feed we also have to deal with different XML namespaces.

You can see the set of namespaces used in the top of the RSS XML document:

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
         xmlns="http://my.netscape.com/rdf/simple/0.9/"
         xmlns:dc="http://purl.org/dc/elements/1.1/"
         xmlns:extra="http://www.biomedcentral.com/xml/schemas/extra/"> 

The xmlns: attributes associate namespaces with URIs — the URIs are the "real" namespaces, but the text after xmlns:, e.g. xmlns:rdf, associates a shorter name with a unique namespace identified by the URI; here, for example, rdf is being associated with the URI http://www.w3.org/1999/02/22-rdf-syntax-ns#.

The namespaces declared here can be used in the XML document, using the short names as prefixes in tags, as for example the <rdf:RDF...> specifies that the start tag is the tag RDF in the namespace rdf, which is the one indentified by URI http://www.w3.org/1999/02/22-rdf-syntax-ns#.

The xmlns="http://my.netscape.com/rdf/simple/0.9/" attribute specifies a default namespace; any tag without a namespace prefix belongs to this namespace.

We can use the DOM attribute tagName to get the tag from the XML node, and the attributes node.namespaceURI and node.localName to extract the two parts of an XML tag: the URI associated with the namespace, and the tag-name within the namespace.

Getting tag, URI and, local name for the root of the BMC RSS feed looks like this:

import urllib
import xml.dom.minidom

url = 'http://www.biomedcentral.com/bmcbioinformatics/rss'
doc = xml.dom.minidom.parse(urllib.urlopen(url))
rss = doc.documentElement
print rss.tagName, rss.namespaceURI, rss.localName 
rdf:RDF http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF 

Notice that we use the documentElement attribute to get the RSS node — the doc object refers to a "Document Node" in the DOM specification and this node contains a single child called the document element, which is the root-tag of the document. Using documentElement is quite similar to the rss = doc.childNodes[0] we used earlier. It is needed here, since the document node does not contain a namespace and tag; the document element (root of the document tree) does.

We can also, recursively, print all the tags, namespace-URIs, and local names:

import urllib
import xml.dom.minidom

url = 'http://www.biomedcentral.com/bmcbioinformatics/rss'
doc = xml.dom.minidom.parse(urllib.urlopen(url))

def recursiveWrite(node, level=0):
    if node.nodeType == node.ELEMENT_NODE:
        print ' '*level, node.tagName, node.namespaceURI, node.localName
        for child in node.childNodes:
            recursiveWrite(child,level+2)

recursiveWrite(doc.documentElement) 

EXERCISE WS.2: Can you use a similar recursive traversal to write a script that extracts titles and descriptions from BMC Bioinformatic's feed?

How about the date tags? The date tags belong to a different name-space than the default, but is this a problem? Will

getDate = getText('date')

work? How about:

getDate = getText('dc:date')

Did we actually use namespaces in solving exercise WS.2?

Not really, we could have made the recursive traversal without knowing much about namespaces.

Of course, the explicit traversal isn't as elegant as the XPath version we used earlier.

Also, we needed to be careful with the date information, though, since we use getElementsByTagName, so we need to look up the date tags by their tag-name, dc:date and not their local name date.

And what would happen if the document had used another shorter name for the dc (http://purl.org/dc/elements/1.1/) namespace? If it changes, we need to access a different tag-name, even though the local name and namespace has not actually changed.

Let us deal with the last problem first.

To get a node from the local name and the namespace URI — rather than the tag-name which might not be the same for the same combination of tag and namespace — we use getElementsByTagNameNS() rather than getElementsByTagName.

This method takes the URI as the first argument, and the local name as the second. We can therefore change

def getText(tag):
    def extractTextData(item):
        node = item.getElementsByTagName(tag)[0].childNodes[0]
        assert node.nodeType == node.TEXT_NODE
        return node.data
    return extractTextData
getTitle = getText('title')
getDescription = getText('description')
getDate = getText('dc:date')

to

def getText(namespace,tag):
    def extractTextData(item):
        node = item.getElementsByTagNameNS(namespace,tag)[0].childNodes[0]
        assert node.nodeType == node.TEXT_NODE
        return node.data
    return extractTextData

default_namespace = 'http://my.netscape.com/rdf/simple/0.9/'
dc_namespace = 'http://purl.org/dc/elements/1.1/'
getTitle =       getText(default_namespace,'title')
getDescription = getText(default_namespace,'description')
getDate =        getText(dc_namespace,'date')

and avoid the problems with different tag-names for the same "real" tag.

And now for the XPaths...

EXERCISE WS.3: What happens if you try to get the items, ignoring the rdf namespace?

for item in xml.xpath.Evaluate("/RDF/item",doc):
    print 'Title:', getTitle(item)
    print 'Date:', getDate(item)
    print

and what happens if you try to include it?

for item in xml.xpath.Evaluate("/rdf:RDF/item",doc):
    print 'Title:', getTitle(item)
    print 'Date:', getDate(item)
    print

The problem we run into is related to the tag-name/local-name issue from earlier. RDF is not a tag-name in the XML document, so we cannot look it up as such. On the other hand, rdf:RDF is a tag-name, but XPath does not recognize the namespace rdf and complains about that.

The solution is to specify a "context" in the call to Evaluate. Right now we do this implicitly by providing doc as the second argument to Evaluate, but we can also explicitly construct a Context object and in doing so specify the set of namespaces. This looks as follows:

namespaces = {
    # None is default
    None  : 'http://my.netscape.com/rdf/simple/0.9/',
    'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    'dc'  : 'http://purl.org/dc/elements/1.1/'
    }

con = xml.xpath.Context.Context(doc, processorNss=namespaces)
for item in xml.xpath.Evaluate("/rdf:RDF/item",context=con):
    #... process item...

The namespaces are specified in a dictionary, mapping the short names to the URI. Here I have used the same short names as in the XML document, but this is not strictly needed; any short name can be used as long as the dictionary maps it to the right URI. This, therefore, behaves identical to the script above:

namespaces = {
    # None is default
    None  : 'http://my.netscape.com/rdf/simple/0.9/',
    'xx'  : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    'dc'  : 'http://purl.org/dc/elements/1.1/'
    }

con = xml.xpath.Context.Context(doc, processorNss=namespaces)
for item in xml.xpath.Evaluate("/xx:RDF/item",context=con):
    #... process item...

You do not need to specify all the namespaces used in the document; the xpath handling only needs to know the namespaces you use in the xpath specification. For example, the BMC Bioinformatics feed uses an extra namespace that I have not mapped above, since I do not use it in the xpath, and the dc namespace that I have included was not strictly needed, since it is not use in the xpath either.

EXERCISE WS.4: Complete the BMC Bioinformatics feed script to use XPaths.

EXERCISE WS.5: Extend the script you wrote in exercise WS.1 so it can download feeds in both of the two formats you can now parse, determine the correct format by inspecting the root-node in the XML document, and then print the titles and dates as earlier.

The Course Schedule (Dealing with HTML)

Extracting data from XML documents is pretty easy, as we have just seen.

Unfortunately, most web-pages are not yet in XML but in HTML, and HTML is not particularly easy to extract information from, and, if you remember how the HTML source for the course schedule looked, it can be rather nasty looking also.

There is no semantics associated with the structure, at least not until some project like the semantic web picks up speed. Therefore, we cannot simply ask for the information we need. We need to figure out how the information is represented in the HTML (which often involves a bit of guessing), and then write a parser for extracting the information (and ignoring irrelevant data).

Parsers like these are notoriously unstable; when the html on the web-site changes (and at some point it will), the parser will no longer be able to extract the relevant information, and it has to be updated. This is far from optimal, and far from state-of-the-art, but sadly state-of-the-craft.

Worse still it is not as well-structured as XML (one of the reasons people are now moving towards using XHTML). This means that the powerful tools we have for processing XML will not work on (common) HTML, complicating the HTML processing even more.

The course schedule is not that bad, though — it is actually XHTML — but still, the markup in this page is for the visual layout, not the information contained in the page.

The reason that the HTML for the page looks as ugly as it does, is that it is badly formatted. This is because it has been computer-generated from another format. You can see this format at the URL: http://www.birc.dk/~besen/TBiB2006/schedule.xml.

Warm up: XML again

This document is a mix of HTML and more structured data, and as a warm up exercise — before trying to extract the information from the HTML page — we can try to extract the lecture schedule from this page.

import urllib
import xml.dom.minidom
import xml.xpath

url = 'http://www.birc.dk/~besen/TBiB2006/schedule.xml'
doc = xml.dom.minidom.parse(urllib.urlopen(url))

def getAllText(node):
    text = []
    def recursive(n):
        text.append((n.nodeType == n.TEXT_NODE and n.data) or "")
        for c in n.childNodes: recursive(c)
    recursive(node)
    return ' '.join(text).strip()

for week in xml.xpath.Evaluate('/body/schedule/week',doc):
    print 'Week', week.getAttribute('number')
    for lecture in week.getElementsByTagName('lecture'):
        print '\t', lecture.getAttribute('date'), '--',
        print lecture.getAttribute('title')
    print 

This is not that different from the RSS parsing that we are familiar with by now, but there are two new things: We collect the text inside the <lecture> tags by a recursive traversal of the sub-tree, collecting all text in text nodes, and we use the getAttribute() method to extract the attributes of nodes, that is, the information in the

<tag key1="val1" key2="val2" ...>

pairs in the start-tags.

EXERCISE WS.6: The lecture-nodes we just parsed contains anchor (<a></a>) tags containing a link to lecture notes. Extend the script to extract those also.

The URL you get from this is a local URL. By prefixing it with the URL of the schedule page, except for the document part — that is, prefixing with http://www.birc.dk/~besen/TBiB2006/ — you get a complete URL you can use to get the lecture notes.

lectureURL = baseURL + localURL
lecture = xml.dom.minidom.parse(urllib.urlopen(lectureURL))

(See also urlparse.urljoin() for a more robust way of concatenating a base URL and a relative URL).

From the lecture, you can get the topics covered from the <div> block with class "goal". Using XPaths we can get this node by:

goal = xml.xpath.Evaluate('//div[@class="goal"]',lecture)[0]

The // basically means "at any level in the document", and the [@class="goal"] specifies that the class attribute must be "goal". The Evaluate always returns a list, but in this case it is a singleton and we take the first element from that to get the goal block.

Extend you script to extract the goals for each lecture in the schedule. Extra points for pretty-printing the goals.

Dealing with XHTML

Now take a look at the schedule HTML page again.

If we look at it a bit, we find that the information we are after is contained in a TABLE nested inside the document, but we can write a path to it: /HTML/BODY/DIV/TABLE/TR/TD[2]/TABLE. The [2] in the path means we want the second column in the rows outermost TABLE.

The schedule can then be extracted from the rows of this table:

pathToSchedule = '/HTML/BODY/DIV/TABLE/TR/TD[2]/TABLE'
scheduleTable = xml.xpath.Evaluate(pathToSchedule,doc)[0]

def unpackRow(row):
    # get the columns of the rows and extract the text
    return map(getAllText, row.getElementsByTagName('TD'))

for row in scheduleTable.getElementsByTagName('TR'):
    week, date, lecture = unpackRow(row)
    if len(week.strip()) != 0:
        print week
        print '\t', date, '--', lecture
    else:
        # no week, so belongs to already written week
        print '\t', date, '--', lecture 

So, dealing with XHTML isn't so bad after all. But notice that the path to the relevant information does not, in any way, tell us what the information we are getting is. Furthermore, the table can, at any time, be moved around, to change the visual markup of the page, and this will ruin the parser for us, even if the logical structure of the document is unchanged.

This is a major problem with extracting information from XHTML documents — but do not worry, I will keep the format of the schedule static while you do these exercises.

EXERCISE WS.7: Redo exercise WS.6, but by extracting the information from the HTML page instead of the XML page.

Dealing with HTML

But now let's have a look at the course plan for Programming in Bioinformatics.

What happens if we try this?

import urllib
import xml.dom.minidom
import xml.xpath

url = 'http://www.daimi.au.dk/~chili/PBI04/plan.html'
doc = xml.dom.minidom.parse(urllib.urlopen(url))

We get a parse error, and the reason for this is that the course plan, although perfectly valid HTML, is not well-structured XML.

Our XML-fu is useless against a document that is not XML, so we need to parse this course plan in a different way.

One solution is to use the HTMLParser from htmllib or SGMLParser from sgmllib.

The two classes are used in roughly the same way: you install callbacks for start and end tags, and these callbacks are then called when the parser encounters the appropriate start and end tags.

For both classes, you install the callbacks by deriving a class and writing methods named start_tag, do_tag, and end_tag, where tag is the name of the tag you want to handle with the callback. start_tag and do_tag are called for start tags (the do_tag is intended for tags without a corresponding end-tag, but if you write both a start_tag and a do_tag handler, the start_tag always takes precedence).

The start_tag and do_tag methods are called with (in addition to self) a list of attributes, where the attributes are key/value pairs. The end_tag method is without arguments (except for self which, of course, is always and argument of a method).

You can use the two classes in exactly the same way, except that the HTMLParser needs a so-called formatter to be instantiated. To extract the non-relative URLs from the Programming in Bioinformatics course plan, we can use the classes as:

import urllib, urlparse
from sgmllib import SGMLParser
from htmllib import HTMLParser
from formatter import NullFormatter


def isRelativeURL(url):
    pieces = urlparse.urlparse(url)
    return pieces[1] == '' # is the location part is empty?
    

class SGMLAnchorParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.seen = dict()
        
    def start_a(self, attributes):
        for name, value in attributes:
            if name == 'href' and \
               not isRelativeURL(value) and \
               value not in self.seen:
                print value
                self.seen[value] = True

class HTMLAnchorParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self,NullFormatter())
        self.seen = dict()
        
    def start_a(self, attributes):
        for name, value in attributes:
            if name == 'href' and \
               not isRelativeURL(value) and \
               value not in self.seen:
                print value
                self.seen[value] = True


url = 'http://www.daimi.au.dk/~chili/PBI04/plan.html'
doc = urllib.urlopen(url).read()

print 'SGMLAnchorParser'
parser = SGMLAnchorParser()
parser.feed(doc)
parser.close()
print

print 'HTMLAnchorParser'
parser = HTMLAnchorParser()
parser.feed(doc)
parser.close()

In this example, we do exactly the same for both the SGML and HTML parser: we define a callback for start anchor tags — by defining the method start_a — and output the href links if they are not relative.

To parse the document, we instantiate a parser object, .feed() it the document, and then .close() the parser to terminate and complete the parsing.

The difference between SGMLParser and HTMLParser is that HTMLParser provides a bit more functionality for processing HTML and provides default tag handlers for HTML tags.

Among the extra functionality provided by HTMLParser is a collection of anchors, so the above example could have been implemented as this:

parser = HTMLParser(NullFormatter())
parser.feed(doc)
parser.close()
seen = dict()
for url in parser.anchorlist:
    if url in seen: continue
    seen[url] = True
    if not isRelativeURL(url):
        print url 

where the links are collected in the parsers anchorlist attribute.

We will not use any of the extra functionality in HTMLParser, so in the following we will use SGMLParser to extract information from HTML parsers.

Getting back to the Programming in Bioinformatics course plan, we want to extract the information in the table: the dates, lectures, excercises and remarks.

EXERCISE WS.8: Examine the source HTML for the page. Can you recognize the structure of the table? Which tags are the relevant ones to parse?

Each row in the table, except for the title row, corresponds to a lecture, and the columns in the rows are, from left to right, the date, lecture, exercises, and remarks. Thus, we are interested in the TR and TD tags.

EXERCISE WS.9: Write an SGMLParser sub-class that recognizes the start and end tags of the TR and TD tags.

One solution could look like this. Run it, and see what happens.

import urllib
from sgmllib import SGMLParser

class CoursePlanParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        
    def start_tr(self, attributes):
        print "Lecture:", 

    def end_tr(self):
        print

    def start_td(self, attributes):
        print "#", 

    def end_td(self):
        pass

url = 'http://www.daimi.au.dk/~chili/PBI04/plan.html'

parser = CoursePlanParser()
parser.feed(urllib.urlopen(url).read())
parser.close() 

The parser extracts the structure of the course plan table, but does not extract the actual content. To get at this, we need to override the handle_data() method of SGMLParser.

Try this out:

import urllib
from sgmllib import SGMLParser

class TextParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.data = []
    def handle_data(self,data):
        self.data.append(data)

def formatText(text):
    def getLines(words):
        line = []
        lineLength = 0
        for word in words:
            line.append(word)
            lineLength += len(word)+1
            if lineLength > 60:
                yield ' '.join(line)
                line = []
                lineLength = 0
        yield ' '.join(line)
    return '\n'.join(getLines(text.split()))

def getText(url):
    parser = TextParser()
    parser.feed(urllib.urlopen(url).read())
    parser.close()

    text = ''.join(parser.data)
    return formatText(text)

print getText('http://www.daimi.au.dk/~chili/PBI04/plan.html') 

The output of this script is all the text (stripped of any markups) from the web-page.

The highlighted part of the script is the parser that extracts the text. It uses the handle_data method to collect the text in a list, that is then later on join()'ed into string that is formatted and printed.

Collecting strings by appending to a list and later join()ing the list is more efficient than concatenating strings, which probably doesn't matter in this example, but which can be significant for larger strings, so it is better to just always stick to this idiom.

The script collects all the text on the web-page, but we can restrict it to only collect the text between certain start and end tags, using this trick:

class AnchorTextParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.data = None

    def start_a(self,attributes):
        self.data = []

    def end_a(self):
        print ''.join(self.data)
        self.data = None

    def handle_data(self,data):
        if self.data is not None:
            self.data.append(data) 

Here, self.data is used to collect the text, just as before, but we set it to None outside of anchor tags — initially in the class constructure, and later on whenever we see a </a> tag — and we only collect data when self.data is not None.

As a slightly longer example, we can extract the links and text inside the anchors:

class AnchorTextParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.links = dict()
        self.data = None

    def start_a(self,attributes):
        self.link = dict(attributes)['href']
        self.data = []

    def end_a(self):
        self.links.setdefault(self.link,[]).append(''.join(self.data))
        self.data = None

    def handle_data(self,data):
        if self.data is not None:
            self.data.append(data)

def getLinks(url):
    parser = AnchorTextParser()
    parser.feed(urllib.urlopen(url).read())
    parser.close()
    return parser.links

links = getLinks('http://www.daimi.au.dk/~chili/PBI04/plan.html')
for link in links.keys():
    print link
    print '-' * len(link)
    for text in links[link]:
        print formatText(text)
        print
    print 

For the course plan, we want to extract the text in the columns, and collect the lectures as all the columns in a row.

By combining what we have learned so far, this should be fairly easy to do:

class CoursePlanParser(SGMLParser):
    def __init__(self):
        SGMLParser.__init__(self)
        self.lectures = []
        self.data = None
        
    def start_tr(self, attributes):
        self.lecture = [] # prepare current row

    def end_tr(self):
        self.lectures.append(self.lecture)

    def start_td(self, attributes):
        self.data = [] # prepare current column

    def end_td(self):
        columnText = ''.join(self.data)
        self.lecture.append(columnText)
        self.data = None # reset

    def handle_data(self,data):
        if self.data is not None:
            self.data.append(data)

def getCoursePlan(url):
    parser = CoursePlanParser()
    parser.feed(urllib.urlopen(url).read())
    parser.close()
    return parser.lectures[1:]  # skipping header

lectures = getCoursePlan('http://www.daimi.au.dk/~chili/PBI04/plan.html')

for lecture in lectures:
    lecture = [formatText(text) for text in lecture]
    print 'Date:'
    print lecture[0]
    print
    print 'Lectures:'
    print lecture[1]
    print
    print 'Exercises:'
    print lecture[2]
    print
    if lecture[3] != '':
        print 'Remarks:'
        print lecture[3]
        print
    print '-' * 80
    print 

EXERCISE WS.10*: A lot of the text in the course plan is actually anchors. Can you extract the links together with the text, and display the links with the output of the script?

Putting it All Together...

The lecture-notes for this course contains a number of exercises. Wouldn't it be nice to be able to extract all the exercises so you could have a look at them, without reading through all the lecture? It would make sense, at least, if you had already read the lecture and now just wanted to do the exercises.

EXERCISE WS.11: Examine the lecture-note pages. Do you recognize a structure that will let you pick out the exercises?

EXERCISE WS.12: Write a script that extracts the exercises from a lecture-note page, using XML parsing and XPaths.

EXERCISE WS.13: Do the same exercise, but using an SGMLParser class.

These scripts will let you extract all the exercises for a single lecture. But what about a script for extracting all the exercises for the course?

EXERCISE WS.14: At the main lecture-notes page there is a list of all the lectures. Use this list to get the URLs for all the lecture-note pages, and then extract the exercises for each page.

Do this exercise using both XML and XPaths, and an SGMLParser.

Summary

We have learnt how to download and parse web-pages. With this new knowledge, we are now ready to attack this weeks exercises, concerning a module wrapping search on the NCBI web site.