On NCBI we can search various databases. We want to access this functionality from our scripts so we can automatically process the search results.
Since there were a lot of exercises at the lecture this week, we will only write a few functions in this exercise. From the examples you see in this exercise, you should be able to extend the NCBI searching functionality as needed.
We will focus on searching for nucleotide sequences, and fetching sequences found in such searches.
Go to the NCBI web-site. At the top you see a search bar with a pull-down menu for selecting databases and a text-field for the search terms. Were we to search manually, this is the search bar we would use.
Exercise NCBI.1: Search the nucleotide database for SARS. Examine the URL you will be sent to.
The URL you are sent to consists of two parts separated by a question-mark: the address of a CGI script, http://www.ncbi.nlm.nih.gov/entrez/query.fcgi, and a list of arguments for the script, db=nucleotide&cmd=search&term=SARS, separated by ampersands.
The URL is actually only for the first page of the results, but here we will only process the first page anyway, so consider it a URL for the complete search results.
The arguments for the script are key-value pairs. It is reasonable to guess that db=nucleotide means that we search the nucleotide database, cmd=search means that we are performing a search, and term=SARS means that we are searching for SARS.
Exercise NCBI.2: Search the nucleotide database for other terms, including terms with white space, e.g., "human genome". Examine the URLs you will be sent to. What happens with white-space in the search terms?
We will use what we have discovered about the URLs to perform a search.
Exercise NCBI.3: Write a function that, given a string of search terms, fetches the HTML page containing the search results. You can use the quote_plus method in the urllib for quoting the search terms.
With the function from exercise NCBI.3 we can get an HTML page for a given set of search terms. We now need to extract the results into a more manageable form.
Exercise NCBI.4: Examine the HTML code for the search results.
Have you done exercise NCBI.4? If so, you have discovered that the data we are interested in is found in <dd>...</dd> blocks, where a description of the sequence, and the id of the sequence, are separated by a <br> tag.
<dd>Homo sapiens seryl-tRNA synthetase (SARS), mRNA <br>gi|16306547|ref|NM_006513.2|[16306547] <br><br></dd>
Exercise NCBI.5: Write a parser that reads a search result page and extracts a list of (description,id) pairs.
You can combine the parser from exercise NCBI.5 with the function from exercise NCBI.3 into a function, search that from a search string gives us a list of hits:
for h in search('pig genome'):
print h
should produce output like this:
('Sus scrofa clone RP44-418D6, WORKING DRAFT SEQUENCE, 21 ordered pieces',
'gi|37651747|gb|AC145444.3|[37651747]')
('Sus scrofa clone RP44-245K7, WORKING DRAFT SEQUENCE, 15 ordered pieces',
'gi|37651746|gb|AC145779.2|[37651746]')
...
("tac93g01.y1 Hydra EST -IV Hydra magnipapillata cDNA 5' similar
to SW:RS12_PIG P46405 40S RIBOSOMAL PROTEIN S12. ;, MRNA sequence",
'gi|37571731|gb|CF674431.1|[37571731]')
("tac93e09.y1 Hydra EST -IV Hydra magnipapillata cDNA 5' similar
to SW:CATL_PIG Q28944 CATHEPSIN L PRECURSOR ;, MRNA sequence",
'gi|37571718|gb|CF674418.1|[37571718]')
Take a look at the results page you get with a search. You have a list of hits where the first line is a link, the second a description of the hit, and the third line an ID for the sequence.
Exercise NCBI.6: Click one of the links and examine the URL.
When I clicked one of the links from my SARS search--NC_004818, SARS coronavirus complete genome, gi|30271926|ref|NC_004718.3|[30271926]--I was sent to the URL http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=nucleotide&list_uids=30271926&dopt=GenBank&term=SARS&qty=1
If I delete the last part of the URL, &term=SARS&qty=1, I notice that I am sent to the same page, so it is reasonable to assume that it is the cmd=Retrieve&db=nucleotide&list_uids=30271926&dopt=GenBank arguments to the CGI script that determines the page.
The list_uids argument is set to 30271926, which is the gi number, the number found after the first | in the id: gi|30271926|ref|NC_004718.3|[30271926].
Given the gi number (which we can get from the IDs we get from our search hits) it seems we can get the search results.
Exercise NCBI.7: Write a function that, given a gi number fetches the web page containing the sequence information.
Exercise NCBI.8: Study the HTML for this page and write a parser that extracts the relevant information. (Hint: look for the <pre>...</pre> block).
You can combine the parser from exercise NCBI.8 with the function from exercise NCBI.7 into a function, fetch that, given a gi number, gives the search result string. The code:
print fetch('30271926')
should print
LOCUS NC_004718 29751 bp ss-RNA linear VRL 11-OCT-2003
DEFINITION SARS coronavirus, complete genome.
ACCESSION NC_004718
VERSION NC_004718.3 GI:30271926
KEYWORDS .
SOURCE SARS coronavirus
...
29521 atctcacata gcaatcttta atcaatgtgt aacattaggg aggacttgaa agagccacca
29581 cattttcatc gaggccacgc ggagtacgat cgagggtaca gtgaataatg ctagggagag
29641 ctgcctatat ggaagagccc taatgtgtaa aattaatttt agtagtgcta tccccatgtg
29701 attttaatag cttcttagga gaatgacaaa aaaaaaaaaa aaaaaaaaaa a
//
Examine the result page again. At the top there is a tool-bar where you can change the display. Set it to FASTA and press the "Display" button. You should now get the data in the form of a FASTA record.
Exercise NCBI.9: Examine the URL for the FASTA display page.
The URL here uses a different CGI script, http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi, and, if you play with the arguments a bit, you will find that view=fasta&db=nucleotide&list_uids=gi-number will send you directly to the page.
Exercise NCBI.10: Write a function, fetch_fasta that fetches the FASTA sequence for a gi number.
We have written functions wrapping searches on NCBI. We can wrap these functions in a module, ncbi, add to it as more functionality is needed--while always keeping the module well factored--and use it later in the course.