That's, in a way, very depressing. The field has been investigated so much that the chance of a serendipitous discovery that would explain everything is extremely low.
The Human Genome Project disappointment.
A few years after that book was printed, the Human Genome Project started (1990-2003). Hopes were sky high: cancer and diabetes were the main targets. On those fronts, the project ended up being a huge disappointment. The somewhat simplistic view that, once we would know our genes, we'd end up knowing all we needed to know to cure many diseases turned out to be false. Protein folding, gene expression, control mechanisms, interactions with the microbiome, etc.... "Minor" complications abound. The main benefit of the HGP is that it opened the doors leading to many new labyrinths that will keep u busy for the next century.
What do we know about the genetics of T1D?
We know that the major genetic Type 1 Diabetes risk factors are genes/proteins of the Major Histocompatibility Complex coding in part for proteins called Human Leukocyte Antigens (HLAs). The terminology in science papers can be a bit confusing because our knowledge of HLAs predates our knowledge of the genes coding them. Researchers typically use the same terms (such as HLA-A) to refer to either the antigen or the gene, depending on their field of investigations (transplant rejection or genetics for example).
A small vocabulary refresher now. Broadly speaking...
- amino-acids are the bricks out of which proteins are made.
- a "base" or "nucleotide" is a molecule used to encode genetic information. There are four different bases in our DNA (we'll call them by their abbreviations A,T, G, C) . In groups of 3, they form a redundant (fault tolerant) code for amino-acids. The two DNA chains are complementay G will always pair C and A will always pair with T. RNA uses the same complementary code with its own set of bases (3 identical to DNA, one different) transfers information from our DNA to ribosomes to create proteins.
- a gene is a string of bases that code for a protein.
- an antigen is a molecule that triggers an immune response. It can be a protein... or almost anything else.
- an antibody is a Y shaped molecule that locks on to antigens during the immune reaction.
Now, that I have said this, I'll need to explain "SNP" and "associated"
SNP stands for single nucleotide polymorphism. Our DNA is made of two complementary chains of nucleotids, for example...
We have a single nucleotide polymorphism when, in a population, we have different "options" for the same location, for example...
in 80% of the population and
in 20% of the population.
It is called "single" because even though two nucleotids change in our DNA, the other change is automatic. On top of that, since our DNA has a preferential direction in which it must be interpreted, only one change matters.
SNPs can be anything: they can be in a coding gene, in a regulatory region, in a "junk DNA" region.
SNPs can lead to the generation of a different protein (when they change the amino acid they code for) or can change nothing (because of the redundant nature of the genetic code).
SNPs can reflect a functional change in a function we haven't discovered yet.
The initial reaction is "What a mess! What can SNPs be useful for?"
When they are in the coding region of a gene, the answer is pretty straightforward, the SNP can directly tag the gene variant you carry. In other cases, the SNP can be part of a piece of your DNA that moves around with a certain gene and be a good proxy for that gene. In other cases, an "apparently random" SNP we can't connect to anything can be correlated/associated with certain diseases in statistical studies called GWAS (Genome Wide Association Studies) - I'll probably go back to those in a later post.
But, as the saying goes, correlation does not mean causation. This is something everyone in the field is fully aware of. When one says that a certain SNP is associated with Type 1 Diabetes, one basically means that this SNP is slightly more frequent in the T1D populations than the non T1D population. Again, in some cases and for some diseases, the SNP may happen to tag the exact coding location that modifies the protein and is directly linked to the disease. But this is not the rule.
If a SNP is more present in the T1D population, it can - in theory - be used to calculate a relative risk increase. In extreme cases, you could read that the presence of SNPxxxxxxx in your DNA indicates that you have 200 times more chances to develop a specific disease. In other cases, the association is weak and the uncertainties so large that the SNP is useless by itself.
So why is there so much research around SNPs? For several reasons
- getting a SNP coverage is much cheaper than full DNA sequencing.
- when SNP match coding genes, the benefit is immediate.
- some specific SNPs or combinations of SNPs can replace more expensive tests.
- SNPs can allow to statistically rebuild a "full" approximate genome (the process is called imputation)
- combining multiple significant SNPs may allow to discriminate otherwise outwardly similar patients
Let's now look at a sample abstract
In this example, the rs763361 SNP belongs to a coding gene CD226 and has a direct effect on the coded protein. That protein happens to be a glycoprotein involved in immunity (mostly of interest here are the NK and cytotoxic lymphocytes for which it could promote adhesion to the target). The coding here could be CC, CT or TT, each of them with a different association level with T1D. The CC genotype would be the "normal" population. The TT genotype would be at some level of risk and the CT allele at yet another level. Risk is provided as an odds ratio, with a wide confidence interval and a P-Value.
This example also shows how things can become tricky extremely quickly. Firstly, CC and TT are two different genotypes. CT is the single nucleotide polymorphism (either vs CC or TT). Secondly, the 95% confidence interval for the increased risk covers the 1.25 to 4.18 range. From a mild increase in risk to a very significant one. Thirdly, the study would probably not stand alone but is considered in the wider context of other studies.
Not that convincing or useful on its own, fits in the bigger picture.
In this very recent paper A Type 1 Diabetes Genetic Risk Score Can Aid Discrimination Between Type 1 and Type 2 Diabetes in Young Adult published in Diabetes Care, Oram et al. exploit cheap SNPs to discriminate between Type 1 and Type 2 diabetes in young adults.
The obesity epidemics has unsettled the old stereotypes: a young adult or late adolescent with Diabetes is not almost automatically a Type 1 Diabetic anymore. The Type 2 Diabetes epidemic has reached such a level that the confusion is possible, especially in antibody negative T1Ds.
The paper exploits two "features" of SNPs
- the ability to act as proxies for actual genes allows the authors to obtain a HLA typing without doing expensive tests.
- the combination of multiple "risk SNPs" results in a strong global risk assessment which in the presence of diabetes confirms the Type 1 diagnostic.
This is intuitively easy to understand: if your relative risk, computed by SNP odd-ratios, of contracting a rare disease is very high, it does not mean that your absolute risk is also very high. This is a topic that begs for a concrete example - and possibly another blog post - but it was the main reason why the FDA hit personal genomics sites very hard a couple of years ago.
If you have made it that far, the time has come for your bonuses (yeah, you get two!)
- my very own 23andme raw SNP data
- and excel spreadsheet with a lot of the currently identified SNPs associated with diabetes and their OR (all those included in the Oram paper cited above and those used by the Stanford Interpretome. Please note that OR, P Values and even in some cases the risk allele are somewhat vague or uncertain.