Tool-Building in Bioinformatics

TBiB Q4/2006

BiRC / Courses / TBiB / Lecture Notes / Python Object Persistence

Python Object Persistence — pickle and shelve

In this lecture we cover how to save python objects to disk using the pickle and shelve modules.

Motivation

Object persistence means that you can save the objects in your program between uses of the program. A simple way of saving the relevant data from a program would be to just make our own file format for how to store our data in a flat file. However in python there are two modules called pickle and shelve that makes this task much easier. In this lecture we take a look at these two modules.

Pickle

The pickle module can turn most Python data types into a stream of bytes and then recreate the objects from the bytes. This means that instead of writing our own functions for serializing objects we can do it easier. The pickle module provides the following function pair: dumps(object) returns a string containing an object in pickle format and loads(string) returns the object contained in the pickle string. The following example makes a string representation of a dictionary object and then creates a new identical object from the string:

>>> import pickle
>>>
>>> t1 = {'svar' : 42, 'liste' : [1,2,3]}
>>> s1 = pickle.dumps(t1)
>>>
>>> t2 = pickle.loads(s1)
>>> print t2
{'liste': [1, 2, 3], 'svar': 42}

The pickle module also contains a pair of functions for writing objects directly to a file stream. The dump(object, file) writes an object to a file and load(file) returns an object from a pickle file. The pickle module can not only be used to serialize built-in types such as tuples and lists but it also works for most user defined types.

Exercise PERSISTENCE.1: Use the dump and load functions to make a program that writes an instance of the following sequence class to a file and then recreates it.

class sequence(object): 
    def __init__(self, identifier, sequence, species, is_dna=True): 
	self.identifier = identifier 
	self.sequence = sequence
	self.species = species
	self.is_dna = is_dna 
	self.is_protein = not is_dna
example = sequence('AB123', 'cctcatcggcggtgtggacatggtgac', 'Sus Scrofa', True)

Exercise PERSISTENCE.2: Change the above program to two separate python programs; one that dumps a sequence to a file and another program that loads the sequence from the file. To make this work both programs needs to know the sequence class, you can achieve this by putting the sequence class definition in a separate module that can be imported from both programs.

By default dump() create pickles using a printable ASCII representation but the function have a final, optional argument that, if True, specifies that pickles will be created using a faster and smaller binary representation. The loads() and load() functions automatically detect whether a pickle is in the binary or text format. In most cases we can use a faster version of pickle called cPickle. Apart from some extension possibilities cPickle contains the same functionality as the pickle module. If you want to use the cPickle module instead of the to pickle module without changing your code you can import it in the following way:

import cPickle as pickle

Shelve

If we want to store more than one data structure in a file we can use the shelve module. A shelve object works like a dictionary, you add, access and remove items by using a string key. A value in a shelve object can be of any data type as long it can be handled by the pickle module. To gain access to a shelve file, import the module and open your file:
import shelve
shelf = shelve.open('filename')
If no file called 'filename' exists it is created. We can access the shelve object as a normal dictionary:
shelf['key'] = object        # create or change the entry for 'key'

object = shelf['key']        # load the entry for 'key'

count = len(shelf)           # return the number of entries

list = shelf.keys()          # get the list of stored keys

found = shelf.has_key('key') # check if there's an entry for 'key'

del shelf['key']             # remove the entry for 'key'

Using a shelve is not the same as just storing a large dictionary in a pickled file and then loading it, because in that case the entire dictionary would be loaded all at once whereas shelves just load the entry that we are interested in.

Exercise PERSISTENCE.3: Create a student class that can store name, address and grade for a student. Then create a persistent dictionary (shelve database) for storing students that uses student id numbers as keys. Put three student objects into the shelve object.

Exercise PERSISTENCE.4: Make a program that access the shelve file created in the previous exercise and prints the names of all students sorted based on their grades.

Updating shelved objects.

Because objects fetched from a persistent dictionary normally does not know that they came from a shelve, changes to a object that came from a shelve only affect the in-memory copy, not the stored entry in the shelve. The affect of the append function in the following example is for instance not saved :

>>> import shelve
>>> shelf = shelve.open('shelve_test')
>>>
>>> shelf['key'] = [1,2,3]
>>> print shelf['key']
[1, 2, 3]
>>>
>>> shelf['key'].append(4)
>>> print shelf['key'] 
[1, 2, 3]

There are two ways of make updating of stored values work. The first possibility is that we first fetch the stored object into memory, then change the object, and finally store it in the shelve again as a whole:

>>> import shelve
>>> shelf = shelve.open('shelve_test')
>>>
>>> shelf['key'] = [1,2,3]
>>> print shelf['key']
[1, 2, 3]
>>>
>>> list = shelf['key']
>>> list.append(4)
>>> shelf['key'] = list
>>> print shelf['key']
[1, 2, 3, 4]
The second option is to use the writeback option when creating the shelve object. If the optional writeback parameter is set to True, all entries accessed are cached in memory, and written back at close time; this can make it handier to mutate mutable entries in the persistent dictionary, but, if many entries are accessed, it can consume vast amounts of memory for the cache, and it can make the close operation very slow since all accessed entries are written back (there is no way to determine which accessed entries are mutable, nor which ones were actually mutated). The cached changes are not written to the file before either the sync or close method is called.
>>> import shelve
>>>
>>> shelf = shelve.open('shelve_test', writeback=True)
>>>
>>> shelf['key'] = [1,2,3]
>>> print shelf['key']
[1, 2, 3]
>>>
>>> shelf['key'].append(4)
>>> shelf.sync()
>>> print shelf['key']
[1, 2, 3, 4]

Exercise PERSISTENCE.5: Make a program that adds one to the grade of all students in the persistent dictionary that have an uneven student id number.

Exercise PERSISTENCE.6: Make a persistent dictionary that hold instances of the sequence class from Exercise 1, indexed by their identifiers. And make a function that given a list of sequence identifiers check that the sequences are in the shelve dictionary. If an identifier do not have an entry the sequence should be automatically downloaded from NCBI and added to the persistent dictionary.