Analysis of 243 Genomes – My First Report [Nov. 2016]


About two weeks ago I learned about this website OpenSNP where people can share their genetic information and not only. It is similar to 1000genomes, but I think it is much more interesting to work with because aside of genetic information (SNP sequencing, exome, etc.) most users also share phenotype data; data is not anonymized. This is what sparked my interest.

With phenotype data and user’s genetic mutations – SNPs – (or other relevant genetic information), I could run analyses and find possible correlations. This is applied big data.

In this post, I’ll explain how I conducted my first analysis. I want to provide an outline with enough relevant details so I can have a reference point to make things easier in future analyses. Of course, I could simply do this in private but I’d rather post it on the blog so that others who are interested to run similar analyses can have starting point.

This involves: knowledge of genomics, genomics related software and raw data formats, programming, and a lot of patience.

RS1051730 and Nicotine Dependence

The OpenSNP database currently includes ~4,000 users who share their genetic data (different file formats). Many of them also report phenotypic data such as eye color, lactose tolerance status, ability to tan, asthma, taste of broccoli, T2D, dyslexia, penis length (!!!), and more.

One of the most reported phenotypic data involves nicotine dependence. More than 400 users have provided their status. So, I decided to look into this. Here’s an outline of what I did:

  1. I went on the page of users who share data about this phenotype. I copied the table to a csv file, which includes username (hyperlinked to user profile page) and their nicotine dependence status. I wrote a python script and then hand-cleaned the results to get a list of links to user profiles.

Note: Not all users who share phenotype data have uploaded a file with their genetic data. There were approximately 300 users who remained in this intermediate list.

I coded two python scripts to run through the table, go to each user’s profile page, and see if they have a 23andme report of SNP (single nucleotide polymorphisms). The second script would output a file that contains download links for user who share their genetic data:

import os
list23 = []
for file in os.listdir('e:\\phen\\fileshtml'):
    with open(file, 'r') as f:
        for line in f:
            if 'Download this set (23andme)' in line:
with open('23andme.txt', 'wt') as q:
    for i in list23:
  1. Instead of opening each individual download link and save it on my computer, I wrote another short script that would do it like there’s no tomorrow…

I ended up with about 5 GB worth of 23andme raw genetic data an hour later.

import os
import wget
with open('23andme.txt', 'r') as f:
    for url in f:

Actually, I started the script and went to bed. When I returned in the morning I saw the data and the time it took to run the script.

  1. And now for the fun part…

I decided to look at the rs1051730 genotype for each user (~300 users at this time). A 23andme raw data report is a text file of approximately 900k (yes, nine hundred thousand) lines long. Time to unleash ‘the’ python again…

import os
import csv
dictGNTP = {}
for file in os.listdir():
        file_op = open(file, 'r')
        for line in file_op:
            if 'rs1051730	15' in line:
writefile1 = open('23andmegntp.csv', 'w', newline='')
writer1 = csv.writer(writefile1)
for filename, genotype in dictGNTP.items():
    writer1.writerow([filename, genotype])

This script would go into each genetic file and look for rs1051730 on chromosome 15. It would finish with a csv report in which each row would display:

filename(of raw genetic 23andme data) and the rs1051730 genotype

This would help me keep track on users. This csv data would then be included in a final spreadsheet.

  1. The final spreadsheet contains the following columns:

#username, #phenotype (nicotine dependence status), #rs1051730 genotype, #hyperlink (to user profile)

I had a lot of cleaning to do on this file: repetitive and unpleasant work. I hope to find an algorithm to put this step on autopilot too. It would make everything faster and it would minimize human error.

The number of users (sample size) that made it into the analysis is: n=243.


Finally, I created a plot that shows each genotype (AA, AG, GG) with a count for each phenotypic variation.

The Plot and Some Thoughts


My purpose with this post is not to make interpretations of the results but to create a precedent. However, I will make a few remarks:

rs1051730 is a mutation in the nicotinic acetylcholine receptor alpha 3 subunit CHRNA3 gene.

The risk allele is “T” and the risk genotype is “TT”, as presented in dbSNP. However, publications and studies report for the opposite strand: “A” being the risk allele and “AA” the risk genotype. My report is in accordance with the studies.

– 2 studies (PMID 18385738PMID 18385676) from 2008 of 6,000 lung cancer patients of European Ancestry:

“A” allele associated with increased risk:

“AG”, “GA” – 1.3x risk for lung cancer
“AA” – 1.8x risk for lung cancer

– those who are “AA” may find it more difficult to quit smoking
rs1051730 linked to alcohol use/abuse, nicotine dependence and lung cancer.

Read more about this SNP and the dozens of studies linked to it here.

In my analysis, as you can see in the plot:

– most users carry either “AG” or “GG” (n=224), while only a few are “AA” (n=19).
– if you are “GG” and you smoke, you may have a lower risk of developing lung cancer. I wouldn’t use this as a pro-smoking reason though.
– if you are “AA” and you smoke, the bad news is plural: higher risk of developing lung cancer (1.8x) and finding it more difficult to quit.

I suspect I carry GG as it was not difficult for me to quit smoking 6.5 years ago (after smoking for about 9 years). However, I cannot know for sure unless I sequence my genome.

If you are really interested about interpreting these genetic mutations as well as other associated with smoking, alcohol, and addictive behavior, once again, read the studies referenced by SNPedia. I only scratched the surfaced with my superficial interpretation, as it was not the point of the post.

Photo 1: OpenSNP

Get on my list of friends
More about my book Persistent Fat Loss
More about my book Ketone Power
More about my book T-(Rx)
More about my book Periodic Fasting

Related posts:



One Response to Analysis of 243 Genomes – My First Report [Nov. 2016]

  1. Peter Wojciechowski says:

    Great post Cristi! You might also be interested in this recent study “Mutational signatures associated with tobacco smoking in human cancer”.

Leave a Reply

Your email address will not be published. Required fields are marked *