| ![]() |
TBiB Q4/2006
|
Exercise: Searching NCBIIn this exercise, we wrap searching on the National Center for Biotechnology Information (NCBI) web site. We will write a module for searching for nucleotide sequences from a python script. Supplementary reading: Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils). MotivationOn NCBI we can search various databases. We want to access this functionality from our scripts so we can automatically process the search results. Since there were a lot of exercises at the lecture this week, we will only write a few functions in this exercise. From the examples you see in this exercise, you should be able to extend the NCBI searching functionality as needed. We will focus on searching for nucleotide sequences, and fetching sequences found in such searches. Searching
Web-resources:
National Center for Biotechnology Information (NCBI). Go to the NCBI web-site. At the top you see a search bar with a pull-down menu for selecting databases and a text-field for the search terms. Were we to search manually, this is the search bar we would use. Exercise NCBI.1: Search the nucleotide database for e.g. Sus scrofa — consult Entrez Nucleotide Help for the search syntax for more advanced searches. Try clicking on some of the hits and see the resulting pages. Have a look at the various output-formats supported (the drop-down menu next to the "Display" button). We are going to write a Python module for accessing the NCBI databases, and script that will let us perform the kind of searches in exercise NCBI.1 from Python.
Web-resources:
Entrez Programming Utilities. One approach is to extract the information we need from the HTML pages — and I am sure that you, by inspecting the HTML pages you just accessed will be able to extract the relevant information, as this was an exercise in Programming in Bioinformatic — but for this particular web-site there is a better approach: using the Entrez Programming Utilities (eUtils). The eUtils are a set of web-services, where queries are encoded in a URL and the query results returned through HTTP, the protocol used for the world wide web. For searching, the URL to use is http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi Query Strings
We provide arguments to the service through a query string
appended to this URL. A query string is started by the character
Building a query string from a list of key-value pairs '&'.join([key+'='+val for key,val in l]) or a dictionary '&'.join([key+'='+val for key,val in d.items()])
is trivial, but best handled using the method
For our searching, however, we will not bother with building
general query strings, but only send pre-determined parameters, so
don't worry about
The parameters for the search are We can build the query string from a template string 'db=%s&term=%s' and then build the search URL as
_searchURL = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
def search(db,term):
query = _searchURL + ('?db=%s&term=%s' % (db,term))
# perform query...
I have put the URL outside the Anyway, back to the code...
I am not quite satisfied with the way I translate my template
string into the query string. I use the
In this particular case it doesn't matter that much, but when you
have long template strings, with many variables to fill in,
depending the right order of arguments to the string-formatting is
risky, so I have gotten used to using the "dictionary" version of
This would look like:
_searchURL = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
def search(db,term):
query = _searchURL+('?db=%(db)s&term=%(term)s'%{'db':db,'term':term})
# perform query...
The template now specifies the keys of the values to be inserted and the arguments are given as the values of a dictionary.
Building the dictionary for the
The function
_searchURL = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'
def search(db,term):
query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
# perform query...
which is the idiom I prefer to use for filling out templates.
As I wrote earlier, it doesn't matter much in this case — I
doubt you will forget the order of arguments to Validating Input
We have two more problems with the
The first problem is easily solved by having a table of valid
databases, and raising an exception if
_databases = {'nucleotide':1,
# other databases
}
class UnknownDatabaseEx(LookupError):
pass
def search(db,term):
if db not in _databases: raise UnknownDatabaseEx
# ...
Quoting Strings
The second problem is solved, not by validating the parameter, but
by making sure that it is a valid string before using it in the
query string. For this we can use
def search(db,term):
if db not in _databases: raise UnknownDatabaseEx
term = urllib.quote_plus(term)
query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
# perform query...
Extracting Search ResultsWith the URL constructed, we know how to get the resulting page.
Exercise NCBI.2: Extend the Can you extend your function to extract all the IDs in the result XML? The result of the query is an XML page, and we know how to deal with those. Let us, for starters, just extract the IDs from the search.
def _getText(node):
'''Extracts the text in the node (assuming it
contains only single, text, child)'''
assert node.firstChild.nodeType == node.TEXT_NODE
return node.firstChild.data
def _getTextList(doc, tag):
'''Extracts the text in the nodes with tagname "tag".'''
return map(_getText, doc.getElementsByTagName(tag))
def search(db,term):
'''Searches the database "db" for "term", returning the matching ids.'''
if db not in _databases: raise UnknownDatabaseEx
term = urllib.quote_plus(term)
query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
doc = xml.dom.minidom.parse(urllib.urlopen(query))
return _getTextList(doc,'Id')
I have added two auxiliary functions to help me extract the text from the IDs, but otherwise there should be no surprises here. Since I am writing a module that I expect to reuse in the future, I have also added documentation strings to the function — for one-shot scripts I usually don't bother, but for anything I expect to get back to again a few weeks or months down the road, I know better than to leave documentation out. To be able to use my module both as an actual module and a script, I also add a "main part" that, when the module is run as a script, lets me search NCBI:
if __name__ == '__main__':
database = raw_input('database> ')
term = raw_input('search string> ')
for id in search(database, term):
print id
The Exercise NCBI.3: Try running this script with parameters you expect the search to fail for. Can you recognise warnings or exceptions in the resulting XML? Extend the function to handle these cases by raising exceptions. Result ObjectsThe XML document returned by a search contains more than just a list of IDs, and it would be nice to return the additional information also. One obvious way of doing this would be to return a tuple or list of values, but this solution is suboptimal: it is difficult to remember the order of such values, and in most cases, most of the returned values are not of interest to the caller anyway.
A better solution would be for
We could write a class
result = search(database, term)
print 'Search found', result.count, 'hits.'
print 'IDs from %d to %d:' % (result.start,result.stop)
for id in result.ids:
print '\t', id
However, we are not likely to ever instantiate a result object
outside of
class search(object):
'''A search in database "db" for "term".'''
_template = _searchURL + '?db=%(db)s&term=%(term)s'
def _query(self):
'''Perform a query and return the DOM object.'''
query = search._template % self.__dict__
return xml.dom.minidom.parse(urllib.urlopen(query))
def __init__(self, db, term):
'''Setting up a search in database "db" for "term".'''
if db not in _databases: raise UnknownDatabaseEx
self.db = db
self.term = urllib.quote_plus(term)
dom = self._query()
self.count = int(_getTextList(dom,'Count')[0])
self.start = int(_getTextList(dom,'RetStart')[0])
self.stop = self.start + int(_getTextList(dom,'RetMax')[0])
self.ids = _getTextList(dom,'Id')
In this class I use a class attribute,
The I liked the syntax
for id in search(database, term):
# process id...
though, and I am not happy with giving it up. No worries, I just
add the method
class search(object):
# ...other methods...
def __iter__(self):
return iter(self.ids)
The special method Here we just want to iterate over the IDs, so we return an iterator over the ID list. This is a kind of operator overloading in Python, for those of you who knows what that means. Getting All the IDsSo far, we have only processed the IDs returned with the default search, but those are only the first 20 of potentially thousands of hits. We want to be able to get all the hits.
We can add two parameters to the query string to get this:
Exercise NCBI.4: Extend the
If you use parameters to
_template = 'a:%(a)s, b:%(b)s, x:%(x)s y:%(y)s'
_defaults = {'a':'A', 'b':'B'}
def printXY(x,y):
d = _defaults.copy()
d.update(locals())
print _template % d
Using the new Actually, I would prefer the syntax result = search(database, term) print result[0,5] over result = search(database, term) print result.getIDs(0,5)
But that is easily handled using the special method
class search(object):
# ...other methods...
def __getitem__(self, idx):
start, stop = idx
return self.getIDs(start,stop)
and now I can write code such as
for id in search(database, term)[0,5]:
# ... process id ...
to extract the IDs of the first five hits. It is a bit wasteful, since we perform two queries here — one in the class constructor, the other in the sub-scripting — but let's not worry about that for now.
Warning!
Properties require the new type of classes introduced with Python 2.2, which means that to use them you have to inherit from object.
Exercise NCBI.5*: If you remove
the Use properties to catch access to the attributes that needs to be initialized, and initialize them with the first accessing of an attribute or any other query. What happens now if I write:
for id in search(database, term):
# process id...
I still only get the first 20 (or whatever number of hits I get), and all the hits, but that is what I want! Getting all the thousands of hits in one go, and saving them in a list, is probably not the best solution; is there a better? Sure there is!
Remember the iterator we returned earlier in the
An iterator is really just an object with a For our purpose here, we can let our iteration object store a buffer of hits and return hits from that; when the buffer is empty, it fetches more, until all hits have been fetched and returned, after which it terminates the iteration. My solution looks like this:
class search(object):
# ... other methods ...
class SearchIterator(object):
'''Iteration through search hits.'''
def __init__(self,searchObject,start=0,bufSize=20):
self.so = searchObject
self.latest = start
self.bufSize = bufSize
self.hits = []
def next(self):
'''Returns next hit.'''
if len(self.hits) == 0:
# fetch more hits
self.hits = self.so[self.latest, self.latest+self.bufSize]
if len(self.hits) == 0:
# no more hits, signal termination
raise StopIteration
self.hits.reverse()
self.latest += self.bufSize
return self.hits.pop()
def __iter__(self):
return search.SearchIterator(self)
I have made the iterator class local to the
I reverse the fetched hits since the Nifty, eh? Summarising Search ResultsOf course, just getting the IDs isn't particularly interesting; they are telling us nothing about the data we have found with our search. To get a summary of the hits, we can use the summary service. The URL to use is http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi And the query string will again specify the database we access.
The IDs we want summarized are provided in the query string in the
Thus, refactoring the URLs, we can easily write a function that gets the XML for a summary query:
_eUtilsURL = 'http://www.ncbi.nlm.nih.gov/entrez/eutils'
_searchURL = _eUtilsURL+'/esearch.fcgi'
_summaryURL = _eUtilsURL+'/esummary.fcgi'
def summaries(db,ids):
'''Fetches the summaries for a list of ids from database "db".'''
if db not in _databases: raise UnknownDatabaseEx
idList = ','.join(ids)
query = _summaryURL + ('?db=%(db)s&id=%(idList)s' % locals())
return urllib.urlopen(query).read()
Exercise NCBI.6: Extend the
database = 'nucleotide'
term = 'Sus scrofa [organism]'
hits = search(database, term)[8,12]
for title, caption in summaries(database, hits):
print title, '\n\t', caption, '\n'
prints
AY954688
Sus scrofa fatty acid synthase (FASN) mRNA, partial cds
AY952929
Sus scrofa fatty acid synthase (FASN) mRNA, partial cds
AY948544
Sus scrofa POU1F1 gene, exons 1, 2 and partial cds
AY248037
Sus scrofa UL16 binding protein (ULBP1) mRNA, complete cds
Hint: Using XPaths, it should be fairly simple to extract
the right items, based on the attribute
Unfortunately, the interface of Exercise NCBI.7: What would happen if I did this: summaries(database, search(database, term)) Please do not experiment here, just think about what might happen...before reading on.
Let me try to explain what happens. The
Warning!
Do not call join() using an iterator that potentially
iterates over thousands of elements.
So doing the above will actually work, but be careful! If the search term generates thousands of hits, the summary query will be quite long and this is likely to either take very long time or, more likely, result in an error. But we can try it out with a search that is guaranteed not to generate that many hits:
database = 'nucleotide'
term = 'Mikkel Schierup'
for title, caption in summaries(database, search(database, term)):
print title, '\n\t', caption, '\n'
See, it works!
But it is a bit silly to have to provide the database argument to
both the
Exercise NCBI.8: Extend the summaries(search(database, term)) How do we deal with the problem of large numbers of hits?
One possibility is to make
Fetching a single summary at a time can be done achieved by a
generator — which is a special kind of function that behaves
like an iterator, that is, calling the function automatically
produces an iterator. A generator looks like a function, but
whenever it executes the We can use a generator for our purposes like this:
def summaries(searchObject):
'''Fetches the summaries for a search.'''
def getInfo(id,db=searchObject.db):
'''Extracts caption and title of a hit.'''
# ... get the info — see exercise NCBI.6 ...
for id in searchObject:
yield getInfo(id)
Unfortunately, this is a very inefficient solution. It performs a query for each ID, and since this involves accessing a web-server somewhere in the states, this can be quite time consuming.
It would be much better to download batches of summaries, as we
did with batches of IDs in
To make it easier for ourselves to add this functionality to
Exercise NCBI.9: Refactor
Hint: You can translate the search object into an iterator
(the iterator = iter(searchObject)
and then get the next hit whenever this iterator's
The use-pattern from before —
class summaries(object):
# ...other methods ...
def __iter__(self):
return self
We inform Python on how to iterate over our new class, we can use it just like the generator, and the refactoring is completed.
The
Exercise NCBI.10: Refactor the
The refactoring we did by making Now that we have the functionality, we should consider whether the two services really should look alike. Are they actually supposed to look similar? And if they are, can they share any functionality? Exercise NCBI.11: Are there any similarities between the two classes that could reasonably be refactored and moved to a super-class and shared between the two? It also makes sense to try alternative solutions...
Exercise NCBI.12: Can you achieve
What about
Even if you can implement Fetching SequencesFinally, we wish to be able to get the actual sequences also, of course. For this, we use the fetch service ( More information here about fetching sequences). The URL for the fetch service is http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi Like the summary service, this service requires a database and a list of IDs in the query string. Try this out:
_fetchURL = _eUtilsURL+'/efetch.fcgi'
def fetch(db, id):
'''Fetch the sequence with ids in "ids" in "database".'''
if db not in _databases: raise UnknownDatabaseEx
query = _fetchURL + ('?db=%(db)s&id=%(id)s' % locals())
print urllib.urlopen(query).read()
The output is a bit nasty, isn't it?
But we can specify other output formats through parameters in the
query string. To get the output in XML, set the
def fetch(db, id):
if db not in _databases: raise UnknownDatabaseEx
'''Fetch the sequence with ids in "ids" in "database".'''
query = _fetchURL + ('?retmode=xml&db=%(db)s&id=%(id)s' % locals())
print urllib.urlopen(query).read()
You can further change the output format by setting the retrieval
type using the
Exercise NCBI.13: Change the
Is this interface good for
Exercise NCBI.14: Change the function to take any sequence
or iterator of IDs as argument, and return the resulting sequences
in an iterator. Hint: this can be done quite easily using
a generator (iterate over the inputs and
Exercise NCBI.15: Change your solution to NCBI.14 to buffer
the fetches, similar to
Exercise NCBI.16: Change the function to take a
Exercise NCBI.17: Is this a better design? What are the
pros and cons? How well does it work with the
SummaryWe have written functions wrapping searches on NCBI using eUtils. We can wrap these functions in a module, ncbi, add to it as more functionality is needed — while always keeping the module well factored — and use it later in the course. Exercise NCBI.18: Can you think of any improvements to the interfaces of the three functions? Remember that interfaces should be consistent, so keep in mind how changing one interface might influence the other. Also keep the usage patterns in mind — typically you use a search to find sequences you then fetch, or perhaps you use a filtering of some sort using the summaries. We have not covered all aspects of eUtils; in particular we did not cover how the use of "history" information can keep a state between search, summary, and fetch calls. For more information read Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils). For another view on a Python module for accessing NCBI, you can check out Andrew Dalke's Eutils package. This package shows a more mature wrapper around NCBI searches than what we have built here, but is basically just what we could have ended up at if we had continued developing our package through a number of iterations. |