Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Lecture Notes / Accessing Databases

Accessing Databases

In this lecture, we cover database access--how to extract data from an SQL database using MySQL and Python. We will cover creating and updating databases in the next lecture.

Supplementary reading: Python—how to program, Chapt. 17, pages 569-612.

Motivation

Databases are used for, well, storing data, obviously, but more importantly for presenting shared data to different applications in a uniform way. Databases provide a way for applications to work on data without redundancies and data-inconsistencies.

By far the most popular kind of database is the relational database, where data is represented as rows in a number of tables, and where the data is accessed through the Structured Query Language (SQL).

In this and the next lecture, we will explore how to use SQL to access data from databases and how to use this data in Python scripts.

SQL and databases is a very large topic in itself, so we will not have time to cover it all, but hopefully what we cover will be enough for most of your needs, and enough to get you started learning more.

Data Representation

The data in a database is stored physically in files — as so much else — but the application programmer need not worry about the actual data representation. One of the most important features of databases is that they present the applications with a logical or conceptual view of the data. The applications can access the data through this logical view and the database manager will map between the logical view and the physical file representation.

For relational databases, the logical view is based on relations, in the mathematical sense of relations between sets: For the sets A, B, and C, the relation AxBxC consists of all triplets with the first element from A, the second from B, and the third from C; if (a,b,c) is in the relation, we say that a, b, and c are related. Meaningful relationships between elements of A, B, and C can be represented as subsets of AxBxC.

The relations in a relational database are represented as tables, such as:

+--------+-----------+---------+-----------+--------------+------------+------------+
| chr_id | chr_start | chr_end | contig_id | contig_start | contig_end | contig_ori |
+--------+-----------+---------+-----------+--------------+------------+------------+
|      1 |   4463697 | 4463851 |         1 |            1 |        155 |          1 |
|      1 |   7791579 | 7834914 |         2 |            1 |      43336 |          1 |
+--------+-----------+---------+-----------+--------------+------------+------------+

where each row contains a tuple of elements from the different sets, associated with each column — the rows contain the data, the columns determine how to interpret the date.

In the table above, for example, the sequence from chr_start to chr_end on the chromosome identified by chr_id is related to the sequence from contig_start to contig_end of contig contig_id in the contig_ori (1 for left to right, -1 for right to left) orientation. Specifically, the sequence from 4463697 to 4463851 of chromosome 1 is related to the sequence from 1 to 155 of contig 1 in the left to right orientation.

Another example, shown below, relates the chromosome identifiers (chr_id) with their names and the length of the chromosomes.

+--------+------+----------+
| chr_id | name | length   |
+--------+------+----------+
|      1 | I    | 15080473 |
|      2 | II   | 15279303 |
|      3 | III  | 13783268 |
|      4 | IV   | 17493790 |
|      5 | V    | 20922238 |
|      6 | X    | 17714429 |
+--------+------+----------+

Above we have a relation relating parts of chromosomes with contigs and a relation relating information about the chromosomes. Storing this information in two relations rather than one avoids redundancy: if we replaced the two tables with a single table we would need to store chromosome information for each sequence relation.

We can extract information from the tables by asking for the rows satisfying a predicate, p, (SELECT (a,b,c) FROM AxBxC WHERE p(a,b,c)).

For example, if we want to know the name of the chromosome with id 1, we can select the rows in the table above where the chr_id column contains a 1 (SELECT chr_id,name,length FROM chromosome_table WHERE chr_id=1). This will give us the single row:

+--------+------+----------+
| chr_id | name | length   |
+--------+------+----------+
|      1 | I    | 15080473 |
+--------+------+----------+

The real power of this representation comes from how we can relate the relations.

When we need to relate information in two tables, we can create the set-product of the relations in the two tables (AxBxC and DxE can be joined to AxBxC x DxE), and we can restrict it with some predicate, p, so only the meaningful relations are created ((a,b,c,d,e) from AxBxC x DxE where p(a,b,c,d)).

In the example above, if we want to know the name of the chromosome the different contigs are related to, and not just the identifier, we can join the two tables with the restriction that the chr_id in the two tables should match.

Using mysql

Before we start accessing databases from Python, we will learn a bit of SQL using the mysql shell. This shell lets us write and execute SQL queries interactively, and so is a better choice for experimenting with SQL than Python.

We will use the world famous retrosearch database of human endogenous retroviruses (HERVs) for our experiments. To connect to that database, start mysql with the command mysql --user=retrosearch --host=serine.daimi.au.dk --password=retrosearch. This will log you in at the database (on the host serine.daimi.au.dk) as the "retrosearch" user.

$ mysql --host=serine.daimi.au.dk --user=retrosearch --password=retrosearch
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 957 to server version: 4.0.17

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> 

Once logged in at the database server, we can get a list of the available databases using the command SHOW DATABASES:

mysql> SHOW DATABASES;
+-------------------------+
| Database                |
+-------------------------+
| affymetrix              |
| biodyk                  |
| est_mapping             |
| herv_and_genes          |
| hervproject_1834        |
| homo_sapiens_core_18_34 |
| mysql                   |
| phpbb                   |
| phpmyadmin              |
| promotordb              |
+-------------------------+
10 rows in set (0.00 sec)

mysql> 

We then have to select a database to for our queries. The one we are after here is the hervproject, so we select that, using the USE-command:

mysql> USE hervproject_1834;
Database changed
mysql> 

Once the database is selected, we can use the SHOW command again to get the list of tables in the database:

mysql> SHOW TABLES;
+----------------------------+
| Tables_in_hervproject_1834 |
+----------------------------+
| affy_probesets             |
| blast_hits                 |
| est_matches                |
| herv_blastx                |
| herv_cluster               |
| herv_cluster_flag          |
| herv_dna                   |
| herv_orf                   |
| herv_web_blast             |
| herv_web_email             |
| herv_web_names             |
| herv_web_orfcounts         |
| herv_web_visits            |
| queries                    |
| queries_runs               |
| rmsk                       |
| rmsk_blastx                |
| rmsk_fam                   |
| rv_proteins                |
+----------------------------+
19 rows in set (0.00 sec)
 
mysql> 

We can learn about the data in the tables using the DESCRIBE command:

mysql> DESCRIBE herv_cluster;
+------------------------+------------------+------+-----+---------+----------------+
| Field                  | Type             | Null | Key | Default | Extra          |
+------------------------+------------------+------+-----+---------+----------------+
| herv_cluster_id        | int(15) unsigned |      | PRI | NULL    | auto_increment |
| chr_name               | varchar(10)      | YES  | MUL | NULL    |                |
| chr_start              | int(10) unsigned |      |     | 0       |                |
| chr_end                | int(10) unsigned |      |     | 0       |                |
| chr_strand             | tinyint(2)       |      |     | 0       |                |
| score                  | float unsigned   | YES  |     | NULL    |                |
| herv_score             | float            | YES  |     | NULL    |                |
| hits                   | int(10) unsigned | YES  |     | NULL    |                |
| best_percent_id        | float            | YES  |     | NULL    |                |
| best_percent_id_run_id | int(15)          | YES  | MUL | NULL    |                |
| best_evalue            | float            |      |     | 0       |                |
| best_evalue_run_id     | int(15)          |      | MUL | 0       |                |
| ltr_found              | varchar(10)      | YES  | MUL | NULL    |                |
| ultr_start             | int(10) unsigned |      |     | 0       |                |
| ultr_end               | int(10) unsigned |      |     | 0       |                |
| dltr_start             | int(10) unsigned |      |     | 0       |                |
| dltr_end               | int(10) unsigned |      |     | 0       |                |
| ltr_distance           | int(10) unsigned | YES  |     | NULL    |                |
| ltr_perc_ident         | float            | YES  |     | NULL    |                |
| ltr_evalue             | double           | YES  |     | NULL    |                |
| flag                   | varchar(50)      | YES  | MUL | NULL    |                |
| temp_flag              | varchar(15)      |      | MUL |         |                |
| blastxpos              | varchar(5)       |      | MUL |         |                |
+------------------------+------------------+------+-----+---------+----------------+
23 rows in set (0.00 sec)

mysql> 

Here, the most interesting columns are the Field and Type column. The Field column contains the names of the columns in the table, and the Type column their type.

For this table, which is a table that positions HERVs on the genome, the first field is the herv_cluster_id, a unique identifier for the elements in the table (you cannot see this from the type, which is an unsigned integer with 15 decimals, but from the Key column ... we will return to keys next lecture, for now just trust me on this). The chr_name field is the name of the chromosome a given HERV is located on, a string of max length 10 (varchar(10) in the Type column), and the chr_start, chr_end, and chr_strand positions the HERV on the chromosome; all three fields are integers, but not over the same range — the start and end needs to be able to address an entire chromosome, while the strand field need only enough values to be able to specify if the strand is positive or negative.

The remaining fields describe the confidence (score) with which the HERV is positioned at the given locus (using BLAST), and locations for various sub-structures of the virus.

EXERCISE DB1.1: Get the description of the tables herv_dna and herv_orf.

The actual data in the tables can be accessed through the SELECT command:

mysql> SELECT herv_cluster_id, chr_name FROM herv_cluster;
+-----------------+----------+
| herv_cluster_id | chr_name |
+-----------------+----------+
|               1 | DR51     |
|               2 | DR51     |
|               3 | DR51     |
|               4 | DR51     |
|               5 | DR51     |
|               6 | DR51     |
|               7 | X        |
|               8 | X        |
|               9 | X        |
|              10 | X        |

  ..........................
 
mysql> 

The command above extracts the columns herv_cluster_id and chr_name from all the rows from the table herv_cluster. In general, between SELECT and FROM you specify the list of columns to extract. A short cut exists for extracting all columns, *, but in most cases you are not interested in all the columns but only a few.

Some of the tables in the database are rather large, so while we are just exploring the database it makes sense to only look at the first few rows in the tables we examine. To do that, we can add a modifier to the SELECT command, LIMIT n, that specifies that we only want to look at the first n rows:

mysql> SELECT herv_cluster_id, chr_name FROM herv_cluster LIMIT 10;
+-----------------+----------+
| herv_cluster_id | chr_name |
+-----------------+----------+
|               1 | DR51     |
|               2 | DR51     |
|               3 | DR51     |
|               4 | DR51     |
|               5 | DR51     |
|               6 | DR51     |
|               7 | X        |
|               8 | X        |
|               9 | X        |
|              10 | X        |
+-----------------+----------+
10 rows in set (0.00 sec)
 
mysql> 

To get a sub-set of the rows, we can provide a predicate that the rows must satisfy. Only rows satisfying the predicate will be selected, the remaining won't.

For instance, to get only HERVs on chromosome 21, we can use the command:

mysql> SELECT herv_cluster_id, chr_start, chr_end, chr_strand 
    -> FROM herv_cluster 
    -> WHERE chr_name = '21'
    -> LIMIT 10;
+-----------------+-----------+----------+------------+
| herv_cluster_id | chr_start | chr_end  | chr_strand |
+-----------------+-----------+----------+------------+
|           44886 |  10139266 | 10140106 |          1 |
|           44887 |  10162966 | 10165343 |          1 |
|           44888 |  10208699 | 10209911 |          1 |
|           44889 |  13346009 | 13354175 |          1 |
|           44890 |  13402024 | 13402996 |          1 |
|           44891 |  13762557 | 13763910 |          1 |
|           44892 |  13800122 | 13801836 |          1 |
|           44893 |  14371013 | 14371587 |          1 |
|           44894 |  14382164 | 14383058 |          1 |
|           44895 |  14409725 | 14418500 |          1 |
+-----------------+-----------+----------+------------+
10 rows in set (0.00 sec)

 
mysql> 

The table and columns are specified as before, but the predicate after WHERE restricts the set of rows.

In the example, you will notice, we extract 10 (using LIMIT) rows of columns (using the SELECT column list) herv_cluster_id, chr_start, chr_end, and chr_strand, restricted to chromosome 21 (the WHERE predicate).

EXERCISE DB1.2: Select (some) HERVs on the positive strand from chromosome 11 (to SELECT on more than one predicate, use AND).

EXERCISE DB1.3: Select HERVs more than 1000bp from the start.

EXERCISE DB1.4: Select HERV DNA sequences (from herv_dna) that are shorter than 3000bp; use the length(seq) function to get the length of a sequence. start.

We can relate the different tables by joining them. In the simplest form, we just list the tables we wish to join after the FROM in the select command:

mysql>  SELECT * FROM herv_cluster, herv_dna limit 10;
  .............................................

mysql> 

With this SELECT command, we can have several columns with the same name. Above, for example, herv_cluster_id appears twice. To identify the different columns, we can prefix them with their table name, as this:

mysql> SELECT herv_cluster.herv_cluster_id,herv_dna.herv_cluster_id
    -> FROM herv_cluster, herv_dna 
    -> LIMIT 10;
+-----------------+-----------------+
| herv_cluster_id | herv_cluster_id |
+-----------------+-----------------+
|               1 |               1 |
|               2 |               1 |
|               3 |               1 |
|               4 |               1 |
|               5 |               1 |
|               6 |               1 |
|               7 |               1 |
|               8 |               1 |
|               9 |               1 |
|              10 |               1 |
+-----------------+-----------------+
10 rows in set (0.03 sec)

mysql>

This kind of join creates the complete product of the tables, which is usually not what we want... Usually, we want to restrict the join such that one id in one table matches an id in another table. Above, for example, we probably want to restrict the join such that the HERV id in the two tables are the same..

We can restrict the rows using a WHERE predicate as earlier:

mysql> SELECT herv_cluster.herv_cluster_id, length(seq)
    -> FROM herv_cluster, herv_dna 
    -> WHERE herv_cluster.herv_cluster_id = herv_dna.herv_cluster_id
    -> LIMIT 10;
+-----------------+-------------+
| herv_cluster_id | length(seq) |
+-----------------+-------------+
|               1 |        3073 |
|               2 |        6924 |
|               3 |        3333 |
|               4 |        6165 |
|               5 |        4938 |
|               6 |        8924 |
|               7 |        2352 |
|               8 |        3077 |
|               9 |        2825 |
|              10 |        3623 |
+-----------------+-------------+
10 rows in set (0.00 sec)

mysql>

The WHERE herv_cluster.herv_cluster_id = herv_dna.herv_cluster_id ensures that we select the DNA sequences matching the HERV in the cluster table.

An alternative syntax for the join above is to use the JOIN command explicitly:

SELECT herv_cluster.herv_cluster_id,herv_dna.herv_cluster_id
FROM herv_cluster INNER JOIN herv_dna USING (herv_cluster_id)
LIMIT 10; 

This explicitly states that we are interested in an INNER JOIN (don't worry about the "INNER") between herv_cluster and herv_dna where the herv_cluster_id columns must match.

As yet another alternative, we can use the NATURAL JOIN; this is equivalent to using INNER JOIN, but where the USING columns are implicitly given to be all the columns in common between the two tables. That is, if table t1 has columns: a,b,c and table t2 has columns c,d,e, then

SELECT * FROM t1 NATURAL JOIN t2; 

is equivalent to

SELECT * FROM t1 INNER JOIN t2 USING (c); 

If table t1 has columns: a,b,c and table t2 has columns b,c,d, then

SELECT * FROM t1 NATURAL JOIN t2; 

is equivalent to

SELECT * FROM t1 INNER JOIN t2 USING (b,c); 

but different from

SELECT * FROM t1 INNER JOIN t2 USING (b); 

Joining tables can also be used when we want to extract rows from more than two tables. You can, for instance, join the three tables herv_cluster, herv_dna, and herv_orf to get the open reading frames (ORFs) start and stop index, relative to the HERVs start index, for the HERVs on chromosome 12, where the DNA sequence is shorter than 3000bp:

mysql> SELECT herv_cluster.herv_cluster_id, orf_start+chr_start, orf_end+chr_start
    -> FROM herv_cluster NATURAL JOIN herv_dna NATURAL JOIN herv_orf
    -> WHERE chr_name = '12' AND length(seq) < 3000
    -> LIMIT 10;
+-----------------+---------------------+-------------------+
| herv_cluster_id | orf_start+chr_start | orf_end+chr_start |
+-----------------+---------------------+-------------------+
|           35162 |             2551606 |           2552611 |
|           35162 |             2551685 |           2552540 |
|           35162 |             2551719 |           2552520 |
|           35162 |             2551299 |           2551506 |
|           35162 |             2552549 |           2552756 |
|           35162 |             2551142 |           2551343 |
|           35162 |             2553005 |           2553200 |
|           35168 |             4152575 |           4152965 |
|           35168 |             4152510 |           4152762 |
|           35168 |             4150759 |           4150987 |
+-----------------+---------------------+-------------------+
10 rows in set (0.00 sec)
 
mysql> 

Interacting with MySQL from Python

Warning!
The MySQLdb Python module is only installed on the Fedora Linux machines, so log onto a horse for these exercises.

Through the MySQLdb module, we can access a MySQL database from Python. In effect, this means that we can send SQL queries to the database and get data return.

As when using the mysql shell, we first need to connect to the database server and select the database. This is done through the MySQLdb.connect() method:

import MySQLdb
connection = MySQLdb.connect(host='serine.daimi.au.dk',
                             user='retrosearch',
                             passwd='retrosearch',
                             db='hervproject_1834')

This creates a connection object through which we can interact with the database.

To actually send queries to the database, we need a cursor object:

cursor = connection.cursor()

The cursors are used for executing the queries and for extracting data; you can have several cursors per connection, that can be used to extract information from several queries at a time.

To query the database, we use the execute() method on the cursor object:

cursor.execute('SELECT herv_cluster_id FROM herv_cluster)

and to extract the result we use on of the methods fetchall(), fetchone(), or fetchmany().

The fetchall() method return the rows resulting from the query as a tuples of tuples; one tuple per row, so

import MySQLdb
connection = MySQLdb.connect(host='serine.daimi.au.dk',
                             user='retrosearch',
                             passwd='retrosearch',
                             db='hervproject_1834')
cursor = connection.cursor()

cursor.execute('''SELECT herv_cluster.herv_cluster_id, length(seq) 
                  FROM herv_cluster NATURAL JOIN herv_dna
                  LIMIT 5''')
print cursor.fetchall() 

will print

((1L, 3073L), (2L, 6924L), (3L, 3333L), (4L, 6165L), (5L, 4938L))

The fetchone() method returns a single row — the next row in the list of rows resulting from the query — or None when all rows have been returned:

row = cursor.fetchone()
while row:
     print row
     row = cursor.fetchone()

The fetchmany() method does something in between: it takes as argument the number of rows it should try to fetch, and return a tuple of at most that many rows (as many as there are left, but no more than the specified number).

rows = cursor.fetchmany(4)
while rows:
    for row in rows:
        print row
    rows = cursor.fetchmany(4) 

Why the different ways of extracting data? The fetchall() method extracts all the rows, which can be a very large set, and might therefore be very slow or take up too much memory. For guaranteed small results, it is very appropriate. The fetchone() will never return too large objects, but by extracting a single row at a time there is a larger communication overhead. The fetchmany() method is a trade-off between the two.

EXERCISE DB1.5: Redo exercises DB1.2 and DB1.4 through the Python interface.

Summary

We have seen how to write simple SQL queries and how to execute these from Python. This covers most of the uses of databases we will need.

There are no exercises particularly for this lecture, instead you will get rich opportunity to experiment with SQL in the exercises for the next lecture.