I remember reading about the Human Genome Project years ago, but I haven’t paid attention to it for some time. Turns out the field is exploding with all sorts of interesting projects and initiatives: 23andme, Ancestry, GEDmatch and The Personal Genome Project to name a few. Some people have even made their raw data publicly available.
The exported files from 23andme use tab-separated values (TSV) with four columns. Ancestry files are similar but also include a tab between the two alleles.
rsid chromosome position genotype rs4477212 1 72017 AA rs3094315 1 742429 AA rs3131972 1 742584 GG rs12124819 1 766409 AA
I suppose this is a simple and easy to understand format that makes sending and interpreting files easy, but it’s hardly practical to work with. So, I’ve written a small command line tool that reads 23andme and Ancestry raw data files and generates an sqlite database file from their content.
Feel free to modify the program and use it for any purpose. If you find it useful it would be nice if you let me know, although it’s absolutely not a requirement. Also feel free to contact me if you need help using it or if you think it could be improved. Source and binary on GitHub.
To import 23andme files into an sqlite database using this tool:
- Place the application in a directory of your choice.
- Place the raw data files your want to import in a folder.
- Run the tool and indicate where the folder with the raw data files is located, what the name of the database should be, and weather or not the files should be tagged.
Example: java -jar sqlite-snp-import.jar --input /raw/data/files/ --output sqlite.db --tag coffee_drinker
This will generate an sqlite database file called sqlite.db that contains the information from the 23andme and Ancestry raw data file(s). Files can also be imported into an already existing database. If the tool runs on the same files, they will not be imported again, but if a new tag is indicated, this tag will be added.
Once a database is created, we can work with the data by connecting to the sqlite database using an appropriate tool. Personally I like to use the database tool in IntelliJ. We can also just use the command line tool by typing sqlite3 db.sqlite.
Working with SQLite, a few examples:
What are the total number of entries in the database?
select count(*) from snp;
Do we have any rows with a particular rsid and genotype?
select * from snp where rsid = "rs4970383" and genotype = "AA";
hash rsid chromosome position genotype ... rs4970383 1 828418 AA
Every entry in the database contains the data from the raw data file as well as a hash. The hash identifies the file the entry belongs to and is used to avoid including the same file multiple times. If we run the tool again in the same folder, we get:
Found 2 files in folder default File 2.txt already inserted into db File 1.txt already inserted into db
Include tags in queries
select rsid, tag from snp join hash_tag on snp.hash = hash_tag.hash
rs12564807 coffee_drinker ...
Create an index
To speed up more complicated queries, we probably want to create an index. If possible, it’s a good idea to generate indexes after files have been inserted, as an index will greatly slow down the insertion process. If we’re not sure what indexes to create, we can add explain query plan before a query to see how sqlite understands the query and weather or not it’s able to use any existing indexes. Then we create the index, example:
create index i_rsid on snp (rsid);
How are genomes distributed among rsid:s?
select s.rsid, s.genotype, count(*) as count, (select count(*) from snp where rsid = s.rsid) as total, round(cast(count(*) as real) / (select count(*) as total from snp where rsid = s.rsid) * 100,0) as percentage from snp s group by genotype, rsid order by rsid
rsid genotype count total percentage i3002968 CC 2 2 100 i3002982 AA 2 2 100 i3002983 CT 1 2 50 i3002983 TT 1 2 50 ...