Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Lecture Notes / Searching NCBI

Exercise: Searching NCBI

In this exercise, we wrap searching on the National Center for Biotechnology Information (NCBI) web site. We will write a module for searching for nucleotide sequences from a python script.

Supplementary reading: Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils).

Motivation

On NCBI we can search various databases. We want to access this functionality from our scripts so we can automatically process the search results.

Since there were a lot of exercises at the lecture this week, we will only write a few functions in this exercise. From the examples you see in this exercise, you should be able to extend the NCBI searching functionality as needed.

We will focus on searching for nucleotide sequences, and fetching sequences found in such searches.

Searching

Go to the NCBI web-site. At the top you see a search bar with a pull-down menu for selecting databases and a text-field for the search terms. Were we to search manually, this is the search bar we would use.

Exercise NCBI.1: Search the nucleotide database for e.g. Sus scrofa — consult Entrez Nucleotide Help for the search syntax for more advanced searches.

Try clicking on some of the hits and see the resulting pages.

Have a look at the various output-formats supported (the drop-down menu next to the "Display" button).

We are going to write a Python module for accessing the NCBI databases, and script that will let us perform the kind of searches in exercise NCBI.1 from Python.

One approach is to extract the information we need from the HTML pages — and I am sure that you, by inspecting the HTML pages you just accessed will be able to extract the relevant information, as this was an exercise in Programming in Bioinformatic — but for this particular web-site there is a better approach: using the Entrez Programming Utilities (eUtils).

The eUtils are a set of web-services, where queries are encoded in a URL and the query results returned through HTTP, the protocol used for the world wide web.

For searching, the URL to use is

http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi

Query Strings

We provide arguments to the service through a query string appended to this URL. A query string is started by the character ? and continues until the character # or until the end of the URL. The string must consist of a list of key=value pairs, separated by &.

Building a query string from a list of key-value pairs

'&'.join([key+'='+val for key,val in l])

or a dictionary

'&'.join([key+'='+val for key,val in d.items()])

is trivial, but best handled using the method urlencode() from the urllib module, which also "quotes" the keys and values (see below).

For our searching, however, we will not bother with building general query strings, but only send pre-determined parameters, so don't worry about urlencode() for now.

The parameters for the search are db which specifies the database to search in, and term which specifies what we search for.

We can build the query string from a template string

'db=%s&term=%s'

and then build the search URL as

_searchURL  = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'

def search(db,term):
    query = _searchURL + ('?db=%s&term=%s' % (db,term))
    # perform query... 

I have put the URL outside the search() function since I want it to be accessible to all functions in the module I am about to write, but prefixed it with an underscore which is Python-speak for saying that it is to be considered private to the module — in languages with a bit more of bondage design, this would actually mean that it would be inaccessible outside of the module, but in Python it can easily be accessed and modified, but by prefixing with _ the module programmer signals that doing so should be avoided, and if you do, you deserve what you get.

Anyway, back to the code...

I am not quite satisfied with the way I translate my template string into the query string. I use the % operator to insert the db and term strings, but use the "sequence" version of %, where the strings are inserted in the order they appear in the sequence after %

In this particular case it doesn't matter that much, but when you have long template strings, with many variables to fill in, depending the right order of arguments to the string-formatting is risky, so I have gotten used to using the "dictionary" version of %

This would look like:

_searchURL  = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'

def search(db,term):
    query = _searchURL+('?db=%(db)s&term=%(term)s'%{'db':db,'term':term})
    # perform query... 

The template now specifies the keys of the values to be inserted and the arguments are given as the values of a dictionary.

Building the dictionary for the % operation is a bit verbose, but now notice that we map keys to variables with the same name. That is, we map 'db' to db and 'term' to term. This mapping is the same as the function's namespace! In the function's namespace, db refers to the value of db and term to the value of term.

The function locals() returns the local namespace as a dictionary, so we can rewrite the code above to

_searchURL  = 'http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi'

def search(db,term):
    query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
    # perform query...   

which is the idiom I prefer to use for filling out templates.

As I wrote earlier, it doesn't matter much in this case — I doubt you will forget the order of arguments to % for this particular example — but rather than using one technique for small templates and another for larger templates, you might as well get used to using one technique that actually scales.

Validating Input

We have two more problems with the search() function, as written: what if db is an unknown database? and what if term contains spaces or special characters not allowed in URLs?

The first problem is easily solved by having a table of valid databases, and raising an exception if db is not in it.

_databases = {'nucleotide':1,
              # other databases
              }

class UnknownDatabaseEx(LookupError):
    pass

def search(db,term):
    if db not in _databases: raise UnknownDatabaseEx
    # ...  

Quoting Strings

The second problem is solved, not by validating the parameter, but by making sure that it is a valid string before using it in the query string. For this we can use urllib.quote_plus():

def search(db,term):
    if db not in _databases: raise UnknownDatabaseEx
    term = urllib.quote_plus(term)
    query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
    # perform query...   

Extracting Search Results

With the URL constructed, we know how to get the resulting page.

Exercise NCBI.2: Extend the search() function to download the web-page from the URL. Examine the page to figure out how it is structured.

Can you extend your function to extract all the IDs in the result XML?

The result of the query is an XML page, and we know how to deal with those. Let us, for starters, just extract the IDs from the search.

def _getText(node):
    '''Extracts the text in the node (assuming it
    contains only single, text, child)'''
    assert node.firstChild.nodeType == node.TEXT_NODE
    return node.firstChild.data

def _getTextList(doc, tag):
    '''Extracts the text in the nodes with tagname "tag".'''
    return map(_getText, doc.getElementsByTagName(tag))

def search(db,term):
    '''Searches the database "db" for "term", returning the matching ids.'''
    if db not in _databases: raise UnknownDatabaseEx
    term = urllib.quote_plus(term)
    query = _searchURL + ('?db=%(db)s&term=%(term)s' % locals())
    doc = xml.dom.minidom.parse(urllib.urlopen(query))
    return _getTextList(doc,'Id') 

I have added two auxiliary functions to help me extract the text from the IDs, but otherwise there should be no surprises here.

Since I am writing a module that I expect to reuse in the future, I have also added documentation strings to the function — for one-shot scripts I usually don't bother, but for anything I expect to get back to again a few weeks or months down the road, I know better than to leave documentation out.

To be able to use my module both as an actual module and a script, I also add a "main part" that, when the module is run as a script, lets me search NCBI:

if __name__ == '__main__':
    database = raw_input('database> ')
    term = raw_input('search string> ')
    for id in search(database, term):
        print id 

The if __name__ == '__main__': construction ensures that this code is not run when the module is imported but only when the script is run as a stand-alone script.

Exercise NCBI.3: Try running this script with parameters you expect the search to fail for. Can you recognise warnings or exceptions in the resulting XML?

Extend the function to handle these cases by raising exceptions.

Result Objects

The XML document returned by a search contains more than just a list of IDs, and it would be nice to return the additional information also. One obvious way of doing this would be to return a tuple or list of values, but this solution is suboptimal: it is difficult to remember the order of such values, and in most cases, most of the returned values are not of interest to the caller anyway.

A better solution would be for search() to return an object who's attributes contained the search results.

We could write a class SearchResults and then instantiate it in search(). If SearchResults objects contained, say, an attribute count containing the number of search hits, an attribute ids containing a list of ids, and attributes start and stop describing the start and stop index of the ids in the list of all hits, we could then use search() like this:

result = search(database, term)
print 'Search found', result.count, 'hits.'
print 'IDs from %d to %d:' % (result.start,result.stop)
for id in result.ids:
    print '\t', id 

However, we are not likely to ever instantiate a result object outside of search(), or ever calling search() without returning a result object (except for exceptional cases where we raise an exception), so we can just as well combine the class and function in just a class, and use it exactly as above.

class search(object):
    '''A search in database "db" for "term".'''

    _template = _searchURL + '?db=%(db)s&term=%(term)s'

    def _query(self):
        '''Perform a query and return the DOM object.'''
        query = search._template % self.__dict__
        return xml.dom.minidom.parse(urllib.urlopen(query))
        
    def __init__(self, db, term):
        '''Setting up a search in database "db" for "term".'''
        if db not in _databases: raise UnknownDatabaseEx
        self.db = db
        self.term = urllib.quote_plus(term)

        dom = self._query()
        self.count = int(_getTextList(dom,'Count')[0])
        self.start = int(_getTextList(dom,'RetStart')[0])
        self.stop  = self.start + int(_getTextList(dom,'RetMax')[0])
        self.ids = _getTextList(dom,'Id') 

In this class I use a class attribute, _template, to contain the query string template, and a auxiliary method, _query() to perform the actual query.

The _query() does not use locals() to get the dictionary for the % operation. This would not work since the only local variable in the method is self. Instead it uses the objects dictionary, self.__dict__, which contains the attributes of the object, and by setting the db and term attributes in the constructor, we make sure that using self.__dict__ has the desired effect.

I liked the syntax

for id in search(database, term):
    # process id... 

though, and I am not happy with giving it up. No worries, I just add the method __iter__:

class search(object):
    # ...other methods...

    def __iter__(self):
        return iter(self.ids) 

The special method __iter__ is called when an instance is used in a for loop (or, in general, used as an iterator) and is expected to return an object that can be used to iterate over.

Here we just want to iterate over the IDs, so we return an iterator over the ID list.

This is a kind of operator overloading in Python, for those of you who knows what that means.

Getting All the IDs

So far, we have only processed the IDs returned with the default search, but those are only the first 20 of potentially thousands of hits. We want to be able to get all the hits.

We can add two parameters to the query string to get this: retstart — that tells the search service where to start the list it returns, and retmax — that tells the service how many IDs we want.

Exercise NCBI.4: Extend the _query() method to take a start and stop index of IDs to return.

If you use parameters to _query() — which I suggest as a better solution than storing start and stop in the object's attributes — this piece of code might be useful for inspiration:

_template = 'a:%(a)s, b:%(b)s, x:%(x)s y:%(y)s'
_defaults = {'a':'A', 'b':'B'}
def printXY(x,y):
    d = _defaults.copy()
    d.update(locals())
    print _template % d 

Using the new _query(), add a method, getIDs(start,stop) that extracts the IDs from start to stop.

Actually, I would prefer the syntax

result = search(database, term)
print result[0,5] 

over

result = search(database, term)
print result.getIDs(0,5) 

But that is easily handled using the special method __getitem__:

class search(object):
    # ...other methods...

    def __getitem__(self, idx):
        start, stop = idx
        return self.getIDs(start,stop) 

and now I can write code such as

for id in search(database, term)[0,5]:
    # ... process id ...

to extract the IDs of the first five hits.

It is a bit wasteful, since we perform two queries here — one in the class constructor, the other in the sub-scripting — but let's not worry about that for now.

Warning!
Properties require the new type of classes introduced with Python 2.2, which means that to use them you have to inherit from object.

Exercise NCBI.5*: If you remove the _query() from the constructor, you avoid the overhead above, but you will not have initialized the attributes such as count.

Use properties to catch access to the attributes that needs to be initialized, and initialize them with the first accessing of an attribute or any other query.

What happens now if I write:

for id in search(database, term):
    # process id... 

I still only get the first 20 (or whatever number of hits I get), and all the hits, but that is what I want!

Getting all the thousands of hits in one go, and saving them in a list, is probably not the best solution; is there a better?

Sure there is!

Remember the iterator we returned earlier in the __iter__ method? Before we returned the default iterator for lists, but we can write our own iteration, with the functionality we want, and return that.

An iterator is really just an object with a next() method, that returns the next object in the iteration, or signals that the iteration has completed by raising the exception StopIteration.

For our purpose here, we can let our iteration object store a buffer of hits and return hits from that; when the buffer is empty, it fetches more, until all hits have been fetched and returned, after which it terminates the iteration.

My solution looks like this:

class search(object):
    # ... other methods ...

    class SearchIterator(object):
        '''Iteration through search hits.'''
        def __init__(self,searchObject,start=0,bufSize=20):
            self.so      = searchObject
            self.latest  = start
            self.bufSize = bufSize
            self.hits    = []

        def next(self):
            '''Returns next hit.'''
            if len(self.hits) == 0:
                # fetch more hits
                self.hits = self.so[self.latest, self.latest+self.bufSize]

                if len(self.hits) == 0:
                    # no more hits, signal termination
                    raise StopIteration
                
                self.hits.reverse()
                self.latest += self.bufSize

            return self.hits.pop()

    def __iter__(self):
        return search.SearchIterator(self)

I have made the iterator class local to the search object — since it is tightly bound to this class — and thus I need to qualify it with search. when I instantiate it in __iter__.

I reverse the fetched hits since the pop() method returns items from the end of a list — this is more efficient in Python than taking them from the front — and I want to iterate over the hits in the same order as I get them from NCBI.

Nifty, eh?

Summarising Search Results

Of course, just getting the IDs isn't particularly interesting; they are telling us nothing about the data we have found with our search.

To get a summary of the hits, we can use the summary service. The URL to use is

http://www.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi

And the query string will again specify the database we access.

The IDs we want summarized are provided in the query string in the id parameter. The IDs are comma-separated, without spaces.

Thus, refactoring the URLs, we can easily write a function that gets the XML for a summary query:

_eUtilsURL  = 'http://www.ncbi.nlm.nih.gov/entrez/eutils'
_searchURL  = _eUtilsURL+'/esearch.fcgi'
_summaryURL = _eUtilsURL+'/esummary.fcgi'

def summaries(db,ids):
    '''Fetches the summaries for a list of ids from database "db".'''
    if db not in _databases: raise UnknownDatabaseEx
    idList = ','.join(ids)
    query = _summaryURL + ('?db=%(db)s&id=%(idList)s' % locals())
    return urllib.urlopen(query).read() 

Exercise NCBI.6: Extend the summaries script to extract the caption and title, such that, e.g.:

database = 'nucleotide'
term =     'Sus scrofa [organism]'
hits = search(database, term)[8,12]
for title, caption in summaries(database, hits):
    print title, '\n\t', caption, '\n' 

prints

AY954688
        Sus scrofa fatty acid synthase (FASN) mRNA, partial cds

AY952929
        Sus scrofa fatty acid synthase (FASN) mRNA, partial cds

AY948544
        Sus scrofa POU1F1 gene, exons 1, 2 and partial cds

AY248037
        Sus scrofa UL16 binding protein (ULBP1) mRNA, complete cds

Hint: Using XPaths, it should be fairly simple to extract the right items, based on the attribute Name.

Unfortunately, the interface of search() and summaries() do not quite fit together — calling search() returns an object from which we can get search results, but summaries() needs to be called with a database name and a list (actually an iterator or sequence) of IDs.

Exercise NCBI.7: What would happen if I did this:

summaries(database, search(database, term))

Please do not experiment here, just think about what might happen...before reading on.

Let me try to explain what happens. The summaries() method collects the hit IDs using string.join(), which iterates through all its sequence elements to create a string.

Warning!
Do not call join() using an iterator that potentially iterates over thousands of elements.

So doing the above will actually work, but be careful! If the search term generates thousands of hits, the summary query will be quite long and this is likely to either take very long time or, more likely, result in an error.

But we can try it out with a search that is guaranteed not to generate that many hits:

database = 'nucleotide'
term =     'Mikkel Schierup'
for title, caption in summaries(database, search(database, term)):
    print title, '\n\t', caption, '\n' 

See, it works!

But it is a bit silly to have to provide the database argument to both the search() and to the summaries() call, so:

Exercise NCBI.8: Extend the summaries() function to get the database from the search object and refactor the function to allow this call:

summaries(search(database, term))

How do we deal with the problem of large numbers of hits?

One possibility is to make summaries() into an iterator also, and let it fetch the summaries in batches.

Fetching a single summary at a time can be done achieved by a generator — which is a special kind of function that behaves like an iterator, that is, calling the function automatically produces an iterator. A generator looks like a function, but whenever it executes the yield statement, it stops and returns a value as the next item in the iteration. When the iteration continues, the function continues from just after the last yield.

We can use a generator for our purposes like this:

def summaries(searchObject):
    '''Fetches the summaries for a search.'''
    def getInfo(id,db=searchObject.db):
        '''Extracts caption and title of a hit.'''
        # ... get the info — see exercise NCBI.6 ...
    for id in searchObject:
        yield getInfo(id)

Unfortunately, this is a very inefficient solution. It performs a query for each ID, and since this involves accessing a web-server somewhere in the states, this can be quite time consuming.

It would be much better to download batches of summaries, as we did with batches of IDs in search().

To make it easier for ourselves to add this functionality to summaries() we can try to make it look more like search() — if we can easily see the similarities, we can easier see how to change one to achieve the same functionality as the other.

Exercise NCBI.9: Refactor summaries() into a class with a next() method, that is, an iterator. Make the next() method download the summary for the next hit in the search object.

Hint: You can translate the search object into an iterator (the SearchIterator class above) by calling the builtin iter() method on the object:

iterator = iter(searchObject)

and then get the next hit whenever this iterator's next() method is called, so you do not need to explicitly use an iterator as in a for statement.

The use-pattern from before — summaries(search(db,term)) — will not quite work yet, but if we add

class summaries(object):
    # ...other methods ...
    def __iter__(self):
        return self 

We inform Python on how to iterate over our new class, we can use it just like the generator, and the refactoring is completed.

The summaries class now looks much like the search class, so it should be fairly easy to refactor it to download summaries in batches.

Exercise NCBI.10: Refactor the next() to download summaries in batches, similar to how we did for search().

The refactoring we did by making summaries into a class was mainly to make it look like search to make it easier to add the similar functionality to summaries.

Now that we have the functionality, we should consider whether the two services really should look alike. Are they actually supposed to look similar? And if they are, can they share any functionality?

Exercise NCBI.11: Are there any similarities between the two classes that could reasonably be refactored and moved to a super-class and shared between the two?

It also makes sense to try alternative solutions...

Exercise NCBI.12: Can you achieve summaries()'s functionality using just a generator? That is, can you, in a function with yield, both do the downloading in batches and returning results of individual hit summaries?

What about search()?

Even if you can implement summaries() as a generator, it might not be the best solution, so compare the class and the generator solution and keep the one you like the most.

Fetching Sequences

Finally, we wish to be able to get the actual sequences also, of course.

For this, we use the fetch service ( More information here about fetching sequences).

The URL for the fetch service is

http://www.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi

Like the summary service, this service requires a database and a list of IDs in the query string.

Try this out:

_fetchURL   = _eUtilsURL+'/efetch.fcgi'

def fetch(db, id):
    '''Fetch the sequence with ids in "ids" in "database".'''
    if db not in _databases: raise UnknownDatabaseEx
    query = _fetchURL + ('?db=%(db)s&id=%(id)s' % locals())
    print urllib.urlopen(query).read() 

The output is a bit nasty, isn't it?

But we can specify other output formats through parameters in the query string. To get the output in XML, set the retmode parameter to xml

def fetch(db, id):
    if db not in _databases: raise UnknownDatabaseEx
    '''Fetch the sequence with ids in "ids" in "database".'''
    query = _fetchURL + ('?retmode=xml&db=%(db)s&id=%(id)s' % locals())
    print urllib.urlopen(query).read() 

You can further change the output format by setting the retrieval type using the rettype parameter. For instance, to get a smaller output in a format call TinyXML, you ask for the retrieval type fasta — why you have to ask for FASTA to get TinyXML is beyond me, but there you go...

Exercise NCBI.13: Change the fetch() function to return the sequence.

Is this interface good for fetch()?

Exercise NCBI.14: Change the function to take any sequence or iterator of IDs as argument, and return the resulting sequences in an iterator. Hint: this can be done quite easily using a generator (iterate over the inputs and yield the fetched sequences).

Exercise NCBI.15: Change your solution to NCBI.14 to buffer the fetches, similar to search() and summaries().

Exercise NCBI.16: Change the function to take a search object as argument and get the IDs from there

Exercise NCBI.17: Is this a better design? What are the pros and cons? How well does it work with the search() and summaries() functions/classes? Is it equally important that it works well with search() and summaries()?

Summary

We have written functions wrapping searches on NCBI using eUtils. We can wrap these functions in a module, ncbi, add to it as more functionality is needed — while always keeping the module well factored — and use it later in the course.

Exercise NCBI.18: Can you think of any improvements to the interfaces of the three functions? Remember that interfaces should be consistent, so keep in mind how changing one interface might influence the other. Also keep the usage patterns in mind — typically you use a search to find sequences you then fetch, or perhaps you use a filtering of some sort using the summaries.

We have not covered all aspects of eUtils; in particular we did not cover how the use of "history" information can keep a state between search, summary, and fetch calls. For more information read Building Customized Data Pipelines Using the Entrez Programming Utilities (eUtils).

For another view on a Python module for accessing NCBI, you can check out Andrew Dalke's Eutils package. This package shows a more mature wrapper around NCBI searches than what we have built here, but is basically just what we could have ended up at if we had continued developing our package through a number of iterations.