This is my third analysis of genotype and phenotype data from OpenSNP, which is a platform where people share their genetic data.
The first analysis was about smoking and the second about diabetes. I took a few genetic mutations (SNPs) associated with these conditions and looked into the genetic and phenotype data provided by the users of the platform.
Genetics and Celiac Disease
In my third analysis (this one) I took 3 single nucleotide polymorphisms (SNPs) associated with Celiac Disease and looked into the genotypes of the users who reported a certain phenotype.
The phenotype data for physician diagnosed Celiac Disease on OpenSNP reported by the users is:
Yes (diagnosed with Celiac Disease)
No (not diagnosed with Celiac Disease)
There were 126 users who reported phenotype data, but only 80 of them made into the analysis (those who reported both phenotype and genotype data). Genotype data is in the form of 23andme chip analysis.
It took an entire month to work on this analysis. I didn’t work on it everyday, but for a few evenings every week, 2-3 hours at a time. I used Python to make my way through retrieving, scrapping and cleaning the data. I wrote a short outline of the process in one of the previous posts.
According to SNPedia:
“Celiac disease, also known as gluten intolerance, is an autoimmune disorder of genetically predisposed individuals, provoked by gluten proteins in wheat and related foods.”
Here are the mutations that I looked into:
– mutation in the HLA-DQA1 gene
– CC genotype is the wildtype – carrying normal/lower/no risk for developing Celiac Disease, unless they carry a mutation for HLA-DQ8
– the risk genotypes are: CT and TT
– mutation in the HLA-DQ8 gene
– TT wildtype, normal risk
– the risk genotypes are: CC and CT
– mutation in the SH2B3 gene. According to NLM:
“This gene encodes a member of the SH2B adaptor family of proteins, which are involved in a range of signaling activities by growth factor and cytokine receptors. The encoded protein is a key negative regulator of cytokine signaling and plays a critical role in hematopoiesis. Mutations in this gene have been associated with susceptibility to celiac disease type 13 and susceptibility to insulin-dependent diabetes mellitus.”
– CC normal genotype
– the risk genotypes are: CT, TT
What do these mean? Well…
“About 90-95% of CD patients carry DQ2.5 heterodimers, encoded by DQA1*05 and DQB1*02 alleles both in cis or in trans configuration, and DQ8 molecules, encoded by DQB1*03:02 generally in combination with DQA1*03 variant. Less frequently, CD occurs in individuals positive for the DQ2.x heterodimers (DQA1≠*05 and DQB1*02) and very rarely in patients negative for these DQ predisposing markers.” [ref]
Simply put, if you have Celiac Disease, it’s very likely you carry any or both of these mutations, aside of other mutations that can make you more or less susceptible to developing the disease.
But, the main Point
Unfortunately, nothing came out of the data. I realized that I was on a road that leads nowhere when I plugged the data into a worksheet (by the end of the analysis). However, in good scientific conduit, one has to report all types of results, no only those that are favorable, positive, or meaningful.
So what was wrong here?
In retrospect, I should have known better. I got hints that this may happen in my previous analysis on diabetes…
In this third analysis, the sample was very unbalanced. Out of the 80 users who entered the analysis, only 3 of them reported a Yes (physician diagnosed) for Celiac Disease. As you may imagine, I cannot take anything meaningful from this data at the moment.
But, as always, the devil is not that black.
I am satisfied that I went through the entire process one more time; practice is invaluable. Actually, I outlined the process (wrote it down) so that I can make it easier for future similar analyses. It involves about 30 steps and I may talk about it in a future video or blogpost.
Second, since new users register on OpenSNP daily, the data is likely to grow. So, it is possible to re-run the analysis at a certain point in the future when the sample becomes more balanced.
In the end, in my next analysis I will make sure to check the characteristics of the sample at the very beginning or even before considering doing the analysis, to avoid situations like the one for this analysis.
Photo 1: Adapted from OpenSNP