I remember reading about the Human Genome Project years ago, but I haven’t paid attention to it for some time. Turns out the field is exploding with all sorts of interesting projects and initiatives: 23andme, Ancestry, GEDmatch and The Personal Genome Project to name a few. Some people have even made their raw data publicly available.

The exported files from 23andme use tab-separated values (TSV) with four columns. Ancestry files are similar but also include a tab between the two alleles.

rsid		chromosome	position	genotype

rs4477212	1 		72017 		AA
rs3094315	1 		742429 		AA
rs3131972	1 		742584 		GG
rs12124819	1 		766409 		AA

I suppose this is a simple and easy to understand format that makes sending and interpreting files easy, but it’s hardly practical to work with. So, I’ve written a small command line tool that reads 23andme and Ancestry raw data files and generates an sqlite database file from their content.

Feel free to modify the program and use it for any purpose. If you find it useful it would be nice if you let me know, although it’s absolutely not a requirement. Also feel free to contact me if you need help using it or if you think it could be improved. Source and binary on GitHub.

Importing files

To import 23andme files into an sqlite database using this tool:

    1. Place the application in a directory of your choice.
    2. Place the raw data files your want to import in a folder.
    3. Run the tool and indicate where the folder with the raw data files is located, what the name of the database should be, and weather or not the files should be tagged.
      Example:
      java -jar sqlite-snp-import.jar --input /raw/data/files/ --output sqlite.db --tag coffee_drinker

This will generate an sqlite database file called sqlite.db that contains the information from the 23andme and Ancestry raw data file(s). Files can also be imported into an already existing database. If the tool runs on the same files, they will not be imported again, but if a new tag is indicated, this tag will be added.

Once a database is created, we can work with the data by connecting to the sqlite database using an appropriate tool. Personally I like to use the database tool in IntelliJ. We can also just use the command line tool by typing sqlite3 db.sqlite.

Working with SQLite, a few examples:

What are the total number of entries in the database?

select count(*) from snp;

Returns: 1492944.

Do we have any rows with a particular rsid and genotype?

select * from snp where rsid = "rs4970383" and genotype = "AA";

Returns:

hash	rsid		chromosome	position	genotype
...	rs4970383	1		828418		AA

Every entry in the database contains the data from the raw data file as well as a hash. The hash identifies the file the entry belongs to and is used to avoid including the same file multiple times. If we run the tool again in the same folder, we get:

Found 2 files in folder default
	File 2.txt already inserted into db
	File 1.txt already inserted into db

Include tags in queries

select rsid, tag
from snp
       join hash_tag on snp.hash = hash_tag.hash

Returns:

rs12564807	coffee_drinker
...

Create an index

To speed up more complicated queries, we probably want to create an index. If possible, it’s a good idea to generate indexes after files have been inserted, as an index will greatly slow down the insertion process. If we’re not sure what indexes to create, we can add explain query plan  before a query to see how sqlite understands the query and weather or not it’s able to use any existing indexes. Then we create the index, example:

create index i_rsid on snp (rsid);

How are genomes distributed among rsid:s?

select s.rsid,
       s.genotype,
       count(*) as count,
       (select count(*) 
       		from snp 
       		where rsid = s.rsid) as total,
       round(cast(count(*) as real) / (select count(*) as total 
       		from snp 
       		where rsid = s.rsid) * 100,0) as percentage
from snp s
group by genotype, rsid
order by rsid

Returns:

rsid		genotype	count	total	percentage

i3002968	CC		2	2	100
i3002982	AA		2	2	100
i3002983	CT		1	2	50
i3002983	TT		1	2	50
...
Categories: java