| ![]() |
| Scripting 2005 |
Web-servicesIn this lecture we cover a few aspects of web-services. There are many aspects of web-services, but most of the complexity is on the server side, which we will not cover in this course. Here we will only concern ourselves with client access to web-services. We will learn:
Supplementary reading: Python—how to program, 20.1 and 20.2, pages 689-692, 15.1-15.4, pages 491-500, 16.1-16.4, pages 529-534, the urllib manual and the sgmllib manual. I realize that some of the material covered here, and in the associated exercises-note, have already been covered in Programming for Bioinformatics, but read on anyway, I promise there will also be material you have not seen before. MotivationMany interesting services are available through web-sites. In most cases, manually accessing the sites is fine, but when we want to integrate the results of a service in a script, this is too cumbersome.
Warning!
If you make extensive use of a service through a script, see if you cannot get to the service through a local stand-alone program. It is considered rather rude to have scripts performing massive requests of a site; the sites are usually only designed to cope with the workload created by interaction with humans, a script can create workloads far exceeding the site's capacity. You should only use scripts to access a web-site if the workload you create is rather small. Similar to calling external programs from scripts, we want to be able to access a web-site from a script. This is especially useful when we need to lookup some information, needed for further processing, in a public database accessible through the web. If you are lucky, you can get to the resource you need in an easy to parse format, e.g. XML, and with a structure where it is easy to extract relevant data. More often the resource is accessed through an HTML page, that can be more or less structured, and where the structure does nothing to help distinguish relevant from irrelevant information. In this lecture, we cover how to extract information from HTML pages. Although this should always be your last resort — almost any other structured format is preferable — you will find that it is very often your only option. The Course Announcements (Dealing with XML)We start out with a very simple problem: we want to extract announcements for the course and information about the course plan from the course web-pages. The announcements are shown, in a human readable form, on both the main page and the announcements page — with only the most recent announcements on the main page — and in a format more fitting for computer processing, on the Really Simple Syndication (RSS) feed. The course schedule page contains a table with information about: material covered, exercises, and additional information (notes or projects). We want to write a script that, given the week number, gives us the information for that particular week. The problem of extracting information from such web-pages can be broken into two sub-problems: downloading the web-page and extracting the relevant information. Downloading a Web-PageThe address of a web-page is in the form of a URI (Uniform Resource Identifier). For web-pages, the term URL (Uniform Resource Locater) is often used instead of URI, but see The W3C description of URIs and URLs for the full story. Here, we will follow convention and use URL.
A URL is on the form
Missing parts are recognized by missing punctuation. For example,
the URL The URL for the course plan page is http://www.daimi.au.dk/~mailund/scripting2005/schedule.html The first part of the URL, the scheme, is http which informs us that the page should be accessed over the HyperText Transfer Protocol. On the web, this is usually the protocol we use, but you will also see ftp URLs (for the File Transfer Protocol), https URLs (for Secure HTTP), mailto URLs (for emails), and so forth.
The For the course schedule, there is no query and fragment part. The module urlparse can be used for manipulating URLs. In this course, we will not consider URLs in more detail, but in the exercises for this lecture we will use the query part of URLs to give parameters to web-services. Given the URL for the course plan, we can download the page using the urllib module: import urllib url = 'http://www.daimi.au.dk/~mailund/scripting2005/schedule.html' page = urllib.urlopen(url) print page.read() page.close() Try running the script above. The output looks rather nasty, eh? Don't worry, though, it is not as bad as it looks (in this particular case) and you will be able to parse it shortly, but we will first have a look at processing the announcements, since this is a slightly easier task. Getting AnnouncementsWith a web-page in hand — downloaded using urllib as above — it is now a matter of finding the relevant information. For the announcements, this is fairly easy. We can download them from the RSS feed from the URL http://www.daimi.au.dk/~mailund/scripting2005/rss.xml. The RSS feed is in a simple XML format that is easy to extract information from. We can start by downloading it and have a look at it: import urllib url = 'http://www.daimi.au.dk/~mailund/scripting2005/rss.xml' print urllib.urlopen(url).read() This will, depending on which and how many announcements have been made, look something like this:
<rss version="2.0">
<channel>
<title>Scripting 2005</title>
<link>
http://www.daimi.au.dk/~mailund/scripting2005/index.html
</link>
<description>
A course on script programming
</description>
<language>en-us</language>
<item>
<title>Schedule for week 14</title>
<pubDate>26/03/2005</pubDate>
<description>
The schedule for week 14 is available.
</description>
<link> ... </link>
</item>
<item>
<title>First lecture note available</title>
<pubDate>26/03/2005</pubDate>
<description>
Lecture notes on process management available.
</description>
<link> ... </link>
</item>
<item>
<title>Pages available</title>
<pubDate>25/03/2005</pubDate>
<description>
The initial WWW pages are ready.
</description>
<link> ... </link>
</item>
</channel>
</rss>
The first part of the RSS describes the channel — the source of the announcements which in this case is the course main page. An RSS file can actually contain several channels, but in this case there is only one. The channel also contains a list of news-items, in <item>-tags, and these are the announcements we are after. Each item contains a title, a date, a description and a link to further information (which in this case is not really that interesting since all the links points to the same page: the main announcements page). Extracting Announcements
Web-resources:
The W3C Document Object Model (DOM) Level 1 Specification xml.dom.minidom — Lightweight DOM implementation. A simple way to process this, is through the Document Object Model (DOM) interface. This is a general interface for manipulating XML documents as were they trees; the interface lets us traverse or modify the tree, accessing subtrees and attributes of each node. This interface to the document is at a rather abstract level, but it is useful for building more application specific interfaces to XML documents.
We translate the raw text of the XML document into a DOM object
using the import urllib import xml.dom.minidom url = 'http://www.daimi.au.dk/~mailund/scripting2005/rss.xml' doc = xml.dom.minidom.parseString(urllib.urlopen(url).read())
or the import urllib import xml.dom.minidom url = 'http://www.daimi.au.dk/~mailund/scripting2005/rss.xml' doc = xml.dom.minidom.parse(urllib.urlopen(url)) With the DOM object in hand, we can manipulate the document tree, using the methods and attributes described in The W3C Document Object Model (DOM) Level 1 Specification or described in Python—How to Program pages 537-539. Explicitly getting the items in the document could look like this:
# <rss></rss> is the first (and only) child of the root
rss = doc.childNodes[0]
# in this case, <rss></rss> contains only a single <channel></channel>
channel = rss.childNodes[0]
# and the items are the children of the channel with tag <item>
items = channel.getElementsByTagName('item')
In this particular case, we could actually exploit that all the items in the document are the ones we are after, and just do
items = doc.getElementsByTagName("item")
but in general we will have to get hold of the right sub-tree before we extract the sub-nodes. This can be a bit of a pain, but luckily there is a way around it: using the XPath-language, which resembles regular expressions, but can be used to identify sub-parts of an XML document.
The xpath to the items we are after is this:
In our Python script, it would look like this:
import xml.xpath
items = xml.xpath.Evaluate('/rss/channel[1]/item', doc)
In any case, however we obtain the list of items, we want to print the information they contain. How does this work?
for item in items:
print item
<DOM Element: item at 0xb7c3048c> <DOM Element: item at 0xb7c3064c> <DOM Element: item at 0xb7c3080c> Hmm, not so well. The problem here is, that the items are DOM objects, and these are not necessarily printed in any meaningful format. Let us instead extract the information contained in the items, and pretty-print each item title, together with the date of the announcement:
for item in items:
print 'Title:', getTitle(item)
print 'Date:', getDate(item)
print
That was easy; now we just need to implement
To get the data we want, we need to access the
Getting the sub-nodes we already know how to do — we just
use the
To get the text, we need to know that a tag-pair that contains raw
text — as is the case for the Knowing this, we write:
def getTitle(item):
# get the text-child of the title sub-node
node = item.getElementsByTagName('title')[0].childNodes[0]
# and then extract the text data
assert node.nodeType == node.TEXT_NODE
return node.data
def getDate(item):
# get the text-child of the date sub-node
node = item.getElementsByTagName('pubDate')[0].childNodes[0]
# and then extract the text data
assert node.nodeType == node.TEXT_NODE
return node.data
There is an ugly but if redundancy there — the two functions look almost identical — but we can solve that with a higher-order function:
def getText(tag):
def extractTextData(item):
node = item.getElementsByTagName(tag)[0].childNodes[0]
assert node.nodeType == node.TEXT_NODE
return node.data
return extractTextData
getTitle = getText('title')
getDate = getText('pubDate')
And, voila, we have a script for displaying the title of the announcements:
import urllib
import xml.dom.minidom
import xml.xpath
def getText(tag):
def extractTextData(item):
node = item.getElementsByTagName(tag)[0].childNodes[0]
assert node.nodeType == node.TEXT_NODE
return node.data
return extractTextData
getTitle = getText('title')
getDate = getText('pubDate')
url = 'http://www.daimi.au.dk/~mailund/scripting2005/rss.xml'
doc = xml.dom.minidom.parse(urllib.urlopen(url))
for item in xml.xpath.Evaluate('/rss/channel[1]/item', doc):
print 'Title:', getTitle(item)
print 'Date:', getDate(item)
print
EXERCISE WS.1: There are many sites on the need with an RSS feed. Not all of them use the same version of RSS (version 2.0) as we leaned how to parse here, but some do. At New Your Times, for instance, there is a whole catalogue of feeds available from http://www.nytimes.com/services/xml/rss/ . Select a few of these news feeds, and write a script that collects the titles and dates for these feeds. Modify it so it also prints the link to the news article associated to each item. Dealing with NamespacesBMC Bioinformatics also have an RSS feed, at http://www.biomedcentral.com/bmcbioinformatics/rss. (If you view this feed in your browser, it will be formatted according to a style-sheet and you will not see the raw XML, but you can do that by viewing the page source). This RSS feed is not RSS 2.0, but can you still extract the item titles and dates? One would think that it was just a matter of using a different XPath, and slightly different tag-names for extracting the text, but for this feed we also have to deal with different XML namespaces. You can see the set of namespaces used in the top of the RSS XML document:
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns="http://my.netscape.com/rdf/simple/0.9/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:extra="http://www.biomedcentral.com/xml/schemas/extra/">
The
The namespaces declared here can be used in the XML document,
using the short names as prefixes in tags, as for example the
The
We can use the DOM attribute Getting tag, URI and, local name for the root of the BMC RSS feed looks like this: import urllib import xml.dom.minidom url = 'http://www.biomedcentral.com/bmcbioinformatics/rss' doc = xml.dom.minidom.parse(urllib.urlopen(url)) rss = doc.documentElement print rss.tagName, rss.namespaceURI, rss.localName rdf:RDF http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF
Notice that we use the We can also, recursively, print all the tags, namespace-URIs, and local names:
import urllib
import xml.dom.minidom
url = 'http://www.biomedcentral.com/bmcbioinformatics/rss'
doc = xml.dom.minidom.parse(urllib.urlopen(url))
def recursiveWrite(node, level=0):
if node.nodeType == node.ELEMENT_NODE:
print ' '*level, node.tagName, node.namespaceURI, node.localName
for child in node.childNodes:
recursiveWrite(child,level+2)
recursiveWrite(doc.documentElement)
EXERCISE WS.2: Can you use a similar recursive traversal to write a script that extracts titles and descriptions from BMC Bioinformatic's feed? How about the date tags? The date tags belong to a different name-space than the default, but is this a problem? Will
getDate = getText('date')
work? How about:
getDate = getText('dc:date')
Did we actually use namespaces in solving exercise WS.2? Not really, we could have made the recursive traversal without knowing much about namespaces. Of course, the explicit traversal isn't as elegant as the XPath version we used earlier.
Also, we needed to be careful with the date information, though,
since we use
And what would happen if the document had used another shorter name
for the Let us deal with the last problem first.
To get a node from the local name and the namespace URI —
rather than the tag-name which might not be the same for the same
combination of tag and namespace — we use
This method takes the URI as the first argument, and the local name as the second. We can therefore change
def getText(tag):
def extractTextData(item):
node = item.getElementsByTagName(tag)[0].childNodes[0]
assert node.nodeType == node.TEXT_NODE
return node.data
return extractTextData
getTitle = getText('title')
getDescription = getText('description')
getDate = getText('dc:date')
to
def getText(namespace,tag):
def extractTextData(item):
node = item.getElementsByTagNameNS(namespace,tag)[0].childNodes[0]
assert node.nodeType == node.TEXT_NODE
return node.data
return extractTextData
default_namespace = 'http://my.netscape.com/rdf/simple/0.9/'
dc_namespace = 'http://purl.org/dc/elements/1.1/'
getTitle = getText(default_namespace,'title')
getDescription = getText(default_namespace,'description')
getDate = getText(dc_namespace,'date')
and avoid the problems with different tag-names for the same "real" tag. And now for the XPaths...
EXERCISE WS.3: What happens if you try to get the items,
ignoring the
for item in xml.xpath.Evaluate("/RDF/item",doc):
print 'Title:', getTitle(item)
print 'Date:', getDate(item)
print
and what happens if you try to include it?
for item in xml.xpath.Evaluate("/rdf:RDF/item",doc):
print 'Title:', getTitle(item)
print 'Date:', getDate(item)
print
The problem we run into is related to the tag-name/local-name
issue from earlier.
The solution is to specify a "context" in the call to
namespaces = {
# None is default
None : 'http://my.netscape.com/rdf/simple/0.9/',
'rdf' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dc' : 'http://purl.org/dc/elements/1.1/'
}
con = xml.xpath.Context.Context(doc, processorNss=namespaces)
for item in xml.xpath.Evaluate("/rdf:RDF/item",context=con):
#... process item...
The namespaces are specified in a dictionary, mapping the short names to the URI. Here I have used the same short names as in the XML document, but this is not strictly needed; any short name can be used as long as the dictionary maps it to the right URI. This, therefore, behaves identical to the script above:
namespaces = {
# None is default
None : 'http://my.netscape.com/rdf/simple/0.9/',
'xx' : 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dc' : 'http://purl.org/dc/elements/1.1/'
}
con = xml.xpath.Context.Context(doc, processorNss=namespaces)
for item in xml.xpath.Evaluate("/xx:RDF/item",context=con):
#... process item...
You do not need to specify all the namespaces used in the
document; the xpath handling only needs to know the namespaces you
use in the xpath specification. For example, the BMC
Bioinformatics feed uses an EXERCISE WS.4: Complete the BMC Bioinformatics feed script to use XPaths. EXERCISE WS.5: Extend the script you wrote in exercise WS.1 so it can download feeds in both of the two formats you can now parse, determine the correct format by inspecting the root-node in the XML document, and then print the titles and dates as earlier. The Course Schedule (Dealing with HTML)Extracting data from XML documents is pretty easy, as we have just seen. Unfortunately, most web-pages are not yet in XML but in HTML, and HTML is not particularly easy to extract information from, and, if you remember how the HTML source for the course schedule looked, it can be rather nasty looking also. There is no semantics associated with the structure, at least not until some project like the semantic web picks up speed. Therefore, we cannot simply ask for the information we need. We need to figure out how the information is represented in the HTML (which often involves a bit of guessing), and then write a parser for extracting the information (and ignoring irrelevant data). Parsers like these are notoriously unstable; when the html on the web-site changes (and at some point it will), the parser will no longer be able to extract the relevant information, and it has to be updated. This is far from optimal, and far from state-of-the-art, but sadly state-of-the-craft.
Web-resources:
HTML/XHTML. Worse still it is not as well-structured as XML (one of the reasons people are now moving towards using XHTML). This means that the powerful tools we have for processing XML will not work on (common) HTML, complicating the HTML processing even more. The course schedule is not that bad, though — it is actually XHTML — but still, the markup in this page is for the visual layout, not the information contained in the page. The reason that the HTML for the page looks as ugly as it does, is that it is badly formatted. This is because it has been computer-generated from another format. You can see this format at the URL: http://www.daimi.au.dk/~mailund/scripting2005/schedule.xml. Warm up: XML againThis document is a mix of HTML and more structured data, and as a warm up exercise — before trying to extract the information from the HTML page — we can try to extract the lecture schedule from this page.
import urllib
import xml.dom.minidom
import xml.xpath
url = 'http://www.daimi.au.dk/~mailund/scripting2005/schedule.xml'
doc = xml.dom.minidom.parse(urllib.urlopen(url))
def getAllText(node):
text = []
def recursive(n):
text.append((n.nodeType == n.TEXT_NODE and n.data) or "")
for c in n.childNodes: recursive(c)
recursive(node)
return ' '.join(text).strip()
for week in xml.xpath.Evaluate('/body/schedule/week',doc):
print 'Week', week.getAttribute('number')
for lecture in week.getElementsByTagName('lecture'):
print '\t', lecture.getAttribute('date'), '--',
print getAllText(lecture)
print
This is not that different from the RSS parsing that we are
familiar with by now, but there are two new things: We collect
the text inside the
pairs in the start-tags.
EXERCISE WS.6: The lecture-nodes we just parsed contains
anchor (
The URL you get from this is a local URL. By prefixing it with
the URL of the schedule page, except for the document part —
that is, prefixing with
lectureURL = baseURL + localURL lecture = xml.dom.minidom.parse(urllib.urlopen(lectureURL))
(See also
From the lecture, you can get the topics covered from the
goal = xml.xpath.Evaluate('//div[@class="goal"]',lecture)[0]
The Extend you script to extract the goals for each lecture in the schedule. Extra points for pretty-printing the goals. Dealing with XHTMLNow take a look at the schedule HTML page again.
If we look at it a bit, we find that the information we are after
is contained in a The schedule can then be extracted from the rows of this table:
pathToSchedule = '/HTML/BODY/DIV/TABLE/TR/TD[2]/TABLE'
scheduleTable = xml.xpath.Evaluate(pathToSchedule,doc)[0]
def unpackRow(row):
# get the columns of the rows and extract the text
return map(getAllText, row.getElementsByTagName('TD'))
for row in scheduleTable.getElementsByTagName('TR'):
week, date, lecture = unpackRow(row)
if len(week.strip()) != 0:
print week
print '\t', date, '--', lecture
else:
# no week, so belongs to already written week
print '\t', date, '--', lecture
So, dealing with XHTML isn't so bad after all. But notice that the path to the relevant information does not, in any way, tell us what the information we are getting is. Furthermore, the table can, at any time, be moved around, to change the visual markup of the page, and this will ruin the parser for us, even if the logical structure of the document is unchanged. This is a major problem with extracting information from XHTML documents — but do not worry, I will keep the format of the schedule static while you do these exercises. EXERCISE WS.7: Redo exercise WS.6, but by extracting the information from the HTML page instead of the XML page. Dealing with HTMLBut now let's have a look at the course plan for Programming in Bioinformatics. What happens if we try this? import urllib import xml.dom.minidom import xml.xpath url = 'http://www.daimi.au.dk/~chili/PBI/plan.html' doc = xml.dom.minidom.parse(urllib.urlopen(url)) We get a parse error, and the reason for this is that the course plan, although perfectly valid HTML, is not well-structured XML. Our XML-fu is useless against a document that is not XML, so we need to parse this course plan in a different way. One solution is to use the HTMLParser from htmllib or SGMLParser from sgmllib. The two classes are used in roughly the same way: you install callbacks for start and end tags, and these callbacks are then called when the parser encounters the appropriate start and end tags.
For both classes, you install the callbacks by deriving a class
and writing methods named
The You can use the two classes in exactly the same way, except that the HTMLParser needs a so-called formatter to be instantiated. To extract the non-relative URLs from the Programming in Bioinformatics course plan, we can use the classes as:
import urllib, urlparse
from sgmllib import SGMLParser
from htmllib import HTMLParser
from formatter import NullFormatter
def isRelativeURL(url):
pieces = urlparse.urlparse(url)
return pieces[1] == '' # is the location part is empty?
class SGMLAnchorParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.seen = dict()
def start_a(self, attributes):
for name, value in attributes:
if name == 'href' and \
not isRelativeURL(value) and \
value not in self.seen:
print value
class HTMLAnchorParser(HTMLParser):
def __init__(self):
HTMLParser.__init__(self,NullFormatter())
self.seen = dict()
def start_a(self, attributes):
for name, value in attributes:
if name == 'href' and \
not isRelativeURL(value) and \
value not in self.seen:
print value
url = 'http://www.daimi.au.dk/~chili/PBI/plan.html'
doc = urllib.urlopen(url).read()
print 'SGMLAnchorParser'
parser = SGMLAnchorParser()
parser.feed(doc)
parser.close()
print
print 'HTMLAnchorParser'
parser = HTMLAnchorParser()
parser.feed(doc)
parser.close()
In this example, we do exactly the same for both the SGML and HTML
parser: we define a callback for start anchor tags — by
defining the method
To parse the document, we instantiate a parser object,
The difference between
Among the extra functionality provided by
parser = HTMLParser(NullFormatter())
parser.feed(doc)
parser.close()
seen = dict()
for url in parser.anchorlist:
if url in seen: continue
seen[url] = True
if not isRelativeURL(url):
print url
where the links are collected in the parsers
We will not use any of the extra functionality in
Getting back to the Programming in Bioinformatics course plan, we want to extract the information in the table: the dates, lectures, excercises and remarks. EXERCISE WS.8: Examine the source HTML for the page. Can you recognize the structure of the table? Which tags are the relevant ones to parse?
Each row in the table, except for the title row, corresponds to a
lecture, and the columns in the rows are, from left to right, the
date, lecture, exercises, and remarks. Thus, we are interested in
the
EXERCISE WS.9: Write an SGMLParser sub-class that
recognizes the start and end tags of the One solution could look like this. Run it, and see what happens.
import urllib
from sgmllib import SGMLParser
class CoursePlanParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
def start_tr(self, attributes):
print "Lecture:",
def end_tr(self):
print
def start_td(self, attributes):
print "#",
def end_td(self):
pass
url = 'http://www.daimi.au.dk/~chili/PBI/plan.html'
parser = CoursePlanParser()
parser.feed(urllib.urlopen(url).read())
parser.close()
The parser extracts the structure of the course plan table, but
does not extract the actual content. To get at this, we need to
override the Try this out:
import urllib
from sgmllib import SGMLParser
class TextParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.data = []
def handle_data(self,data):
self.data.append(data)
def formatText(text):
def getLines(words):
line = []
lineLength = 0
for word in words:
line.append(word)
lineLength += len(word)+1
if lineLength > 60:
yield ' '.join(line)
line = []
lineLength = 0
yield ' '.join(line)
return '\n'.join(getLines(text.split()))
def getText(url):
parser = TextParser()
parser.feed(urllib.urlopen(url).read())
parser.close()
text = ''.join(parser.text)
return formatText(text)
print getText('http://www.daimi.au.dk/~chili/PBI/plan.html')
The output of this script is all the text (stripped of any markups) from the web-page.
The highlighted part of the script is the parser that extracts the
text. It uses the
Collecting strings by appending to a list and later
The script collects all the text on the web-page, but we can restrict it to only collect the text between certain start and end tags, using this trick:
class AnchorTextParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.data = None
def start_a(self,attributes):
self.data = []
def end_a(self):
print ''.join(self.data)
self.data = None
def handle_data(self,data):
if self.data is not None:
self.data.append(data)
Here, As a slightly longer example, we can extract the links and text inside the anchors:
class AnchorTextParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.links = dict()
self.data = None
def start_a(self,attributes):
self.link = dict(attributes)['href']
self.data = []
def end_a(self):
self.links.setdefault(self.link,[]).append(''.join(self.data))
self.data = None
def handle_data(self,data):
if self.data is not None:
self.data.append(data)
def getLinks(url):
parser = AnchorTextParser()
parser.feed(urllib.urlopen(url).read())
parser.close()
return parser.links
links = getLinks('http://www.daimi.au.dk/~chili/PBI/plan.html')
for link in links.keys():
print link
print '-' * len(link)
for text in links[link]:
print formatText(text)
print
print
For the course plan, we want to extract the text in the columns, and collect the lectures as all the columns in a row. By combining what we have learned so far, this should be fairly easy to do:
class CoursePlanParser(SGMLParser):
def __init__(self):
SGMLParser.__init__(self)
self.lectures = []
self.data = None
def start_tr(self, attributes):
self.lecture = [] # prepare current row
def end_tr(self):
self.lectures.append(self.lecture)
def start_td(self, attributes):
self.data = [] # prepare current column
def end_td(self):
columnText = ''.join(self.data)
self.lecture.append(columnText)
self.data = None # reset
def handle_data(self,data):
if self.data is not None:
self.data.append(data)
def getCoursePlan(url):
parser = CoursePlanParser()
parser.feed(urllib.urlopen(url).read())
parser.close()
return parser.lectures[1:] # skipping header
lectures = getCoursePlan('http://www.daimi.au.dk/~chili/PBI/plan.html')
for lecture in lectures:
lecture = [formatText(text) for text in lecture]
print 'Date:'
print lecture[0]
print
print 'Lectures:'
print lecture[1]
print
print 'Exercises:'
print lecture[2]
print
if lecture[3] != '':
print 'Remarks:'
print lecture[3]
print
print '-' * 80
print
EXERCISE WS.10*: A lot of the text in the course plan is actually anchors. Can you extract the links together with the text, and display the links with the output of the script? Putting it All Together...The lecture-notes for this course contains a number of exercises. Wouldn't it be nice to be able to extract all the exercises so you could have a look at them, without reading through all the lecture? It would make sense, at least, if you had already read the lecture and now just wanted to do the exercises. EXERCISE WS.11: Examine the lecture-note pages. Do you recognize a structure that will let you pick out the exercises? EXERCISE WS.12: Write a script that extracts the exercises from a lecture-note page, using XML parsing and XPaths.
EXERCISE WS.13: Do the same exercise, but using an
These scripts will let you extract all the exercises for a single lecture. But what about a script for extracting all the exercises for the course? EXERCISE WS.14: At the main lecture-notes page there is a list of all the lectures. Use this list to get the URLs for all the lecture-note pages, and then extract the exercises for each page.
Do this exercise using both XML and XPaths, and an
SummaryWe have learnt how to download and parse web-pages. With this new knowledge, we are now ready to attack this weeks exercises, concerning a module wrapping search on the NCBI web site. |