Databases 1

In this lecture, we cover database access--how to extract data from an SQL database using MySQL and Python. We will cover creating and updating databases in the next lecture.
Supplementary reading: Python--how to program, Chapt. 17, pages 569-612.

Motivation

Databases are used for, well, storing data, obviously, but more importantly for presenting shared data to different applications in a uniform way. Databases provide a way for applications to work on data without redundancies and data-inconsistencies.

By far the most popular kind of database is the relational database, where data is represented as rows in a number of tables, and where the data is accessed through the Structured Query Language (SQL).

In this and the next lecture, we will explore how to use SQL to access data from databases and how to use this data in Python scripts.

SQL and databases is a very large topic in itself, so we will not have time to cover it all, but hopefully what we cover will be enough for most of your needs, and enough to get you started learning more.

Data Representation

The data in a database is stored physically in files--as so much else--but the application programmer need not worry about the actual data representation. One of the most important features of databases is that they present the applications with a logical or conceptual view of the data. The applications can access the data through this logical view and the database manager will map between the logical view and the physical file representation.

For relational databases, the logical view is based on relations, in the mathematical sense of relations between sets: For the sets A, B, and C, the relation AxBxC consists of all triplets with the first element from A, the second from B, and the third from C; if (a,b,c) is in the relation, we say that a, b, and c are related. Meaningful relationships between elements of A, B, and C can be represented as subsets of AxBxC.

The relations in a relational database are represented as tables, such as:

+---------------+-----------+---------+-----------+--------------+------------+------------+
| chromosome_id | chr_start | chr_end | contig_id | contig_start | contig_end | contig_ori |
+---------------+-----------+---------+-----------+--------------+------------+------------+
|             1 |   4463697 | 4463851 |         1 |            1 |        155 |          1 |
|             1 |   7791579 | 7834914 |         2 |            1 |      43336 |          1 |
+---------------+-----------+---------+-----------+--------------+------------+------------+

where each row contains a tuple of elements from the different sets, associated with each column--the rows contain the data, the columns determine how to interpret the date.

In the table above, for example, the sequence from chr_start to chr_end on the chromosome identified by chromosome_id is related to the sequence from contig_start to contig_end of contig contig_id in the contig_ori (1 for left to right, -1 for right to left) orientation. Specifically, the sequence from 4463697 to 4463851 of chromosome 1 is related to the sequence from 1 to 155 of contig 1 in the left to right orientation.

Another example, shown below, relates the chromosome identifiers (chromosome_id) with their names and the length of the chromosomes.

+---------------+------+----------+
| chromosome_id | name | length   |
+---------------+------+----------+
|             1 | I    | 15080473 |
|             2 | II   | 15279303 |
|             3 | III  | 13783268 |
|             4 | IV   | 17493790 |
|             5 | V    | 20922238 |
|             6 | X    | 17714429 |
+---------------+------+----------+

Above we have a relation relating parts of chromosomes with contigs and a relation relating information about the chromosomes. Storing this information in two relations rather than one avoids redundancy: if we replaced the two tables with a single table we would need to store chromosome information for each sequence relation.

We can extract information from the tables by asking for the rows satisfying a predicate, p, (SELECT (a,b,c) FROM AxBxC WHERE p(a,b,c)).

For example, if we want to know the name of the chromosome with id 1, we can select the rows in the table above where the chromosome_id column contains a 1 (SELECT chromosome_id,name,length FROM chromosome_table WHERE chromosome_id=1). This will give us the single row:

+---------------+------+----------+
| chromosome_id | name | length   |
+---------------+------+----------+
|             1 | I    | 15080473 |
+---------------+------+----------+

The real power of this representation comes from how we can relate the relations.

When we need to relate information in two tables, we can create the set-product of the relations in the two tables (AxBxC and DxE can be joined to AxBxC x DxE), and we can restrict it with some predicate, p, so only the meaningful relations are created ((a,b,c,d,e) from AxBxC x DxE where p(a,b,c,d)).

In the example above, if we want to know the name of the chromosome the different contigs are related to, and not just the identifier, we can join the two tables with the restriction that the chromosome_id in the two tables should match.

Using mysql

Before we start accessing databases from Python, we will learn a bit of SQL using the mysql shell. This shell lets us write and execute SQL queries interactively, and so is a better choice for experimenting with SQL than Python.

We will use the ensembl database at the Sanger Institute for our experiments. To connect to that database, start mysql with the command mysql -u anonymous -h kaka.sanger.ac.uk. This will log you in at the database (on the host kaka.sanger.ac.uk) as the "anonymous" user.

$ mysql -u anonymous -h kaka.sanger.ac.uk
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1059220 to server version: 3.23.54-log
 
Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
 
mysql> 

Once logged in at the database server, we can get a list of the available databases using the command SHOW DATABASES:

mysql> SHOW DATABASES;
+---------------------------------------+
| Database                              |
+---------------------------------------+
| anopheles_gambiae_core_15_2           |
| anopheles_gambiae_core_16_2           |
| anopheles_gambiae_core_17_2a          |
| anopheles_gambiae_est_15_2            |
| anopheles_gambiae_est_16_2            |
| anopheles_gambiae_estgene_15_2        |
| anopheles_gambiae_estgene_16_2        |
          ....
| rattus_norvegicus_lite_15_2           |
| rattus_norvegicus_lite_16_2           |
| rattus_norvegicus_lite_17_2           |
| rattus_norvegicus_snp_15_2            |
| rattus_norvegicus_snp_16_2            |
| rattus_norvegicus_snp_17_2            |
+---------------------------------------+
129 rows in set (0.12 sec)
 
mysql> 

We then have to select a database to for our queries. To access the information on our favourite nematode, Caenorhabditis elegans, we select it using the USE-command:

mysql> USE caenorhabditis_elegans_core_17_102;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
 
Database changed
mysql> 

Once the database is selected, we can use the SHOW command again to get the list of tables in the database:

mysql> SHOW TABLES;
+----------------------------------------------+
| Tables_in_caenorhabditis_elegans_core_17_102 |
+----------------------------------------------+
| analysis                                     |
| assembly                                     |
| chromosome                                   |
| clone                                        |
| contig                                       |
| dna                                          |
    ....
| transcript                                   |
| transcript_stable_id                         |
| translation                                  |
| translation_stable_id                        |
| xref                                         |
+----------------------------------------------+
53 rows in set (0.15 sec)
 
mysql> 

We can learn about the data in the tables using the DESCRIBE command:

mysql> DESCRIBE chromosome;
+---------------+------------------+------+-----+---------+----------------+
| Field         | Type             | Null | Key | Default | Extra          |
+---------------+------------------+------+-----+---------+----------------+
| chromosome_id | int(10) unsigned |      | PRI | NULL    | auto_increment |
| name          | varchar(40)      |      | UNI |         |                |
| length        | int(11)          | YES  |     | NULL    |                |
+---------------+------------------+------+-----+---------+----------------+
3 rows in set (0.06 sec)
 
mysql> 

Here, the most interesting columns are the Field and Type column. The Field column contains the names of the columns in the table, and the Type column their type.

EXERCISE DB1.1: Get the description of the tables assembly, contig, and dna.

The actual data in the tables can be accessed through the SELECT command:

mysql> SELECT * FROM chromosome;
+---------------+------+----------+
| chromosome_id | name | length   |
+---------------+------+----------+
|             1 | I    | 15080473 |
|             2 | II   | 15279303 |
|             3 | III  | 13783268 |
|             4 | IV   | 17493790 |
|             5 | V    | 20922238 |
|             6 | X    | 17714429 |
+---------------+------+----------+
6 rows in set (0.07 sec)
 
mysql> 

The command above extracts all the rows and all the columns from the table chromosome.

Some of the tables in the database are rather large, so while we are just exploring the database it makes sense to only look at the first few rows in the tables we examine. To do that, we can add a modifier to the SELECT command, LIMIT n, that specifies that we only want to look at the first n rows:

mysql> SELECT * FROM contig LIMIT 3;
+-----------+-------------------+----------+--------+-------------+--------+
| contig_id | name              | clone_id | length | embl_offset | dna_id |
+-----------+-------------------+----------+--------+-------------+--------+
|         1 | AC006711.1.1.1055 |        1 |   1055 |        NULL |      1 |
|         2 | Z79752.2.1.43440  |        2 |  43440 |        NULL |      2 |
|         3 | AL023833.1.1.699  |        3 |    699 |        NULL |      3 |
+-----------+-------------------+----------+--------+-------------+--------+
3 rows in set (0.07 sec)
 
mysql> 

The * in the SELECT commands above tells the database that we are interested in all the columns in the database. Alternatively we can specify a subset of the columns, by listing them explicitly:

mysql> SELECT contig_id,name,dna_id FROM contig LIMIT 3;
+-----------+-------------------+--------+
| contig_id | name              | dna_id |
+-----------+-------------------+--------+
|         1 | AC006711.1.1.1055 |      1 |
|         2 | Z79752.2.1.43440  |      2 |
|         3 | AL023833.1.1.699  |      3 |
+-----------+-------------------+--------+
3 rows in set (0.04 sec)
 
mysql> 

To get a sub-set of the rows, we can provide a predicate that the rows must satisfy. Only rows satisfying the predicate will be selected, the remaining won't.

For instance, to get only contigs more than 1000 bases long, we can use the command:

mysql> SELECT * FROM contig WHERE length > 1000 LIMIT 3;
+-----------+-------------------+----------+--------+-------------+--------+
| contig_id | name              | clone_id | length | embl_offset | dna_id |
+-----------+-------------------+----------+--------+-------------+--------+
|         1 | AC006711.1.1.1055 |        1 |   1055 |        NULL |      1 |
|         2 | Z79752.2.1.43440  |        2 |  43440 |        NULL |      2 |
|         4 | Z82072.1.1.9614   |        4 |   9614 |        NULL |      4 |
+-----------+-------------------+----------+--------+-------------+--------+
3 rows in set (0.10 sec)
 
mysql> 

The table and columns are specified as before, but the predicate after WHERE restricts the set of rows.

EXERCISE DB1.2: Select the assemblies related to the chromosome with id 1. Select the assemblies on chromosome 1 in certain ranges, e.g. from index 12345 to 12543.

EXERCISE DB1.3: Select the chromosome_id, chr_start, chr_end, contig_id, contig_start, contig_end, and contig_ori from the assembly table.

We can relate the different tables by joining them. In the simplest form, we just list the tables we wish to join after the FROM in the select command:

mysql> SELECT * FROM chromosome,assembly LIMIT 3;
+---------------+------+----------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
| chromosome_id | name | length   | chromosome_id | chr_start | chr_end | superctg_name | superctg_start | superctg_end | superctg_ori | contig_id | contig_start | contig_end | contig_ori | type   |
+---------------+------+----------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
|             1 | I    | 15080473 |             1 |   4463697 | 4463851 | I             |        4463697 |      4463851 |            1 |         1 |  1 |        155 |          1 | CEL102 |
|             2 | II   | 15279303 |             1 |   4463697 | 4463851 | I             |        4463697 |      4463851 |            1 |         1 |  1 |        155 |          1 | CEL102 |
|             3 | III  | 13783268 |             1 |   4463697 | 4463851 | I             |        4463697 |      4463851 |            1 |         1 |  1 |        155 |          1 | CEL102 |
+---------------+------+----------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
3 rows in set (0.08 sec)
                                                                                                                                                               
mysql> 

With this SELECT command, we can have several columns with the same name. Above, for example, chromosome_id appears twice. To identify the different columns, we can prefix them with their table name, as this:

mysql> SELECT chromosome.chromosome_id, assembly.chromosome_id FROM chromosome,assembly LIMIT 3;
+---------------+---------------+
| chromosome_id | chromosome_id |
+---------------+---------------+
|             1 |             1 |
|             2 |             1 |
|             3 |             1 |
+---------------+---------------+
3 rows in set (0.15 sec)
                                                                                                                                                               
mysql>

This kind of join creates the complete product of the tables, which is usually not what we want... Usually, we want to restrict the join such that one id in one table matches an id in another table. Above, for example, we probably want to restrict the join such that the chromosome id in the chromosome table matches the chromosome id in the assembly table.

We can restrict the rows using a WHERE predicate as earlier:

mysql> SELECT chromosome.name, assembly.*
    -> FROM chromosome, assembly
    -> WHERE chromosome.chromosome_id=assembly.chromosome_id
    -> LIMIT 3;
+------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
| name | chromosome_id | chr_start | chr_end | superctg_name | superctg_start | superctg_end | superctg_ori | contig_id | contig_start | contig_end | contig_ori | type   |
+------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
| I    |             1 |         1 |      60 | I             |              1 |           60 |            1 |       383 |            1 |         60 |          1 | CEL102 |
| I    |             1 |        61 |   47489 | I             |             61 |        47489 |            1 |        78 |            1 |      47429 |          1 | CEL102 |
| I    |             1 |     47490 |  107680 | I             |          47490 |       107680 |            1 |       379 |            1 |      60191 |          1 | CEL102 |
+------+---------------+-----------+---------+---------------+----------------+--------------+--------------+-----------+--------------+------------+------------+--------+
3 rows in set (0.06 sec)
                                                                                                                                                               
mysql>

In the example above, we select the chromosome name from the chromosome table and pair it with all the columns in the assembly table (selected using the syntax assembly.*). The WHERE chromosome.chromosome_id=assembly.chromosome_id ensures that the name we select matches the chromosome referred to in the assembly table.

An alternative syntax for the join above is the JOIN command explicitly:

SELECT chromosome.name, assembly.*
FROM chromosome INNER JOIN assembly USING (chromosome_id)
LIMIT 3; 

This explicitly states that we are interested in an INNER JOIN (don't worry about the "INNER") between chromosome and assembly, where the chromosome_id columns must match.

As yet another alternative, we can use the NATURAL JOIN; this is equivalent to using INNER JOIN, but where the USING columns are implicitly given to be all the columns in common between the two tables. That is, if table t1 has columns: a,b,c and table t2 has columns c,d,e, then

SELECT * FROM t1 NATURAL JOIN t2; 

is equivalent to

SELECT * FROM t1 INNER JOIN t2 USING (c); 

If table t1 has columns: a,b,c and table t2 has columns b,c,d, then

SELECT * FROM t1 NATURAL JOIN t2; 

is equivalent to

SELECT * FROM t1 INNER JOIN t2 USING (b,c); 

but different from

SELECT * FROM t1 INNER JOIN t2 USING (b); 

Joining tables can also be used when we want to extract rows from more than two tables. You can, for instance, join the three tables assembly, contig, and dna using a SELECT * FROM assembly,contig,dna or by explicitly joining them:

mysql> SELECT assembly.chromosome_id, assembly.chr_start, assembly.chr_end, dna.dna_id
    -> FROM assembly NATURAL JOIN contig NATURAL JOIN dna
    -> LIMIT 3;
+---------------+-----------+---------+--------+
| chromosome_id | chr_start | chr_end | dna_id |
+---------------+-----------+---------+--------+
|             1 |   4463697 | 4463851 |      1 |
|             1 |   7791579 | 7834914 |      2 |
|             1 |   9627623 | 9628221 |      3 |
+---------------+-----------+---------+--------+
3 rows in set (0.07 sec)
 
mysql> 

EXERCISE DB1.4: Use joins to extract the dna sequences corresponding to the pieces of the assembly. Notice that the assembly pieces are related to contigs and the contigs to dna sequences.

Interacting with MySQL from Python

Warning!
The MySQLdb Python module is not globally installed, so read these instructions on how to use it.

Through the MySQLdb module, we can access a MySQL database from Python. In effect, this means that we can send SQL queries to the database and get data return.

As when using the mysql shell, we first need to connect to the database server and select the database. This is done through the MySQLdb.connect() method:

import MySQLdb
connection = MySQLdb.connect(host='kaka.sanger.ac.uk',
                             user='anonymous',
                             db='caenorhabditis_elegans_core_17_102')

This creates a connection object through which we can interact with the database.

To actually send queries to the database, we need a cursor object:

cursor = connection.cursor()

The cursors are used for executing the queries and for extracting data; you can have several cursors per connection, that can be used to extract information from several queries at a time.

To query the database, we use the execute() method on the cursor object:

cursor.execute('SELECT chromosome_id,name,length FROM chromosome')

and to extract the result we use on of the methods fetchall(), fetchone(), or fetchmany().

The fetchall() method return the rows resulting from the query as a tuples of tuples; one tuple per row, so

import MySQLdb
connection = MySQLdb.connect(host='kaka.sanger.ac.uk',
                             user='anonymous',
                             db='caenorhabditis_elegans_core_17_102')
cursor = connection.cursor()

cursor.execute('SELECT chromosome_id,name,length FROM chromosome')
print cursor.fetchall() 

will print

((1L, 'I', 15080473L), (2L, 'II', 15279303L), (3L, 'III', 13783268L),
 (4L, 'IV', 17493790L), (5L, 'V', 20922238L), (6L, 'X', 17714429L))

The fetchone() method returns a single row--the next row in the list of rows resulting from the query--or None when all rows have been returned:

row = cursor.fetchone()
while row:
     print row
     row = cursor.fetchone()

The fetchmany() method does something in between: it takes as argument the number of rows it should try to fetch, and return a tuple of at most that many rows (as many as there are left, but no more than the specified number).

rows = cursor.fetchmany(4)
while rows:
    for row in rows:
        print row
    rows = cursor.fetchmany(4) 

Why the different ways of extracting data? The fetchall() method extracts all the rows, which can be a very large set, and might therefore be very slow or take up too much memory. For guaranteed small results, it is very appropriate. The fetchone() will never return too large objects, but by extracting a single row at a time there is a larger communication overhead. The fetchmany() method is a trade-off between the two.

EXERCISE DB1.5: Redo exercises DB1.2 and DB1.3 through the Python interface.

EXERCISE DB1.6: Write a function that, given a chromosome name, extracts (some of) the assembly entries for that chromosome. Notice that the function should take the name, not the id! You need to consider both the chromosome and the assembly table.

EXERCISE DB1.7: Write a function that, given a contig name, extracts the corresponding dna sequence. You need to consider both the contig and the dna table.

Summary

We have seen how to write simple SQL queries and how to execute these from Python. This covers most of the uses of databases we will need.

We can now move on to this weeks exercises.

Valid XHTML 1.0! Valid CSS! Time-stamp: "2003-11-28 15:02:58 mailund"