SNPFile is a library and API for manipulating large SNP datasets with associated meta-data, such as marker names, marker locations, individuals' phenotypes, etc. in an I/O efficient binary file format.
In its core, SNPFile assumes very little about the metadata associated with markers and individuals, but leaves this up to application program protocols.
SNPFile is released under the GNU General Public License.
To cite SNPFile, please use:
SNPFile — A software library and file format for large scale association mapping and population genetics studies. J. Nielsen and T. Mailund (2008). BMC Bioinformatics 9(526).
SNPFile is written in C++ and is available as source code (under the GNU General Public License, GPL) and as binary versions as Linux RPM or Debian package files. The source code has been successfully compiled on various Linux and UNIX systems. As I have only limited access to different machines, it is not possible for me to make binary distributions for all platforms, but if anyone is willing to build the distributions I will be more than happy to put them on this site..
SNPFile requires the Boost Library to be installed. Boost can be obtained from http://www.boost.org.
The most recent versions of SNPFile can be downloaded below, older versions are available from here.
The rpm-files were built on Linux Fedora Core 5 or 6. The deb-files were built on Ubuntu Feisty Fawn or Gutsy Gibbon. If you have problems installing them on other RPM or Debian based systems, please let me know.
To build the source files, first uncompress and untar the file, then run 'configure' and finally 'make'. To test that the build was successful, run 'make check'. To install the program, run 'make install'.
$ tar zxf snpfile-version.tar.gz
$ cd snpfile-version
$ ./configure
$ make
$ make check
$ make install
SNPFile consists both of a library, libsnpfile.a, with an API for tool development, plus a few convenience programs for manipulating SNPFile files.
The API reference manual is available here and we hope to have a users manual available soon.
The convenience programs consist of:
The input data consists of two files: a positions file (a list of ordered space separated integers) and a genotypes file with one or two lines per individual (depending on whether the data is phased or unphased) where each line is a list of space separated allels: 0 and 1 for homozygotes and 2 for heterozygotes (with 2 only allowed for unphased data). The first column is a 'pseudo'-allele used for the case/control dichotomy: a 0 in the first column is taken to mean that the individual is a control and a 1 at the first column is taken to mean that the individual is a case.
Run text2snpfile --help for details.
Run snpfile2text --help for details.
Exporting the data — or usually just selected regions of it — to Haploview can be very useful for visualising the LD structure in the data.
Run snpfile2haploview --help for details.
Exporting to fastPHASE is useful for infering the phase for genotype data. The data can be converted back to SNPFile format using fastPHASE2snpfile .
The output does not contain individual IDs, so use option -n when running fastPHASE.
Run snpfile2fastPHASE --help for details.
The fastPHASE tool is useful for inferring unknown phase, and this tool imports the imputed phase from the output of fastPHASE.
The importer assumes that the input is in fastPHASE's "simple" output format, so to use the importer, fastPHASE must be run with the -Z option.
Run fastPHASE2snpfile --help for details.
Run snpfile_phenotypes --help for details.
Run snpfile_markers --help for details.
Run snpfile_genotype_count --help for details.
We also provide a Python interface in the form of a Python extension module. We hope to provide a user's manual for the Python module soon.
For bug-reports or feature requests, please use our bug-tracking software.
For comments or questions, please contact Thomas Mailund <mailund@birc.au.dk>, Bioinformatics Research Center (BiRC), University of Aarhus, Høegh-Guldbergsgade 10, DK-8000 Århus C.
Contact: mailund@birc.au.dk