For humans, the history of DNA began just over 150 years ago when Friedrich Miescher accidentally discovered nuclein, a phosphate-rich molecule that did not behave like a protein and that, over the years, would end up called DNA. But in this short story there are two other great milestones. The first, when almost seventy years ago Rosalind Franklin, and later James Watson and Francis Crick, discovered through X-ray images, that DNA was structured as a double helix. The second milestone came just 20 years ago, when two teams of scientists from different countries published the first sequence of the human genome.
The project to sequence the human genome was extremely long and expensive. But the nearly $ 3 billion invested were justified by the possibility to better understand the diseases of genetic origin, which at that time could be counted with the fingers of one hand. In the words of Roderic Guigó, head of the computational biology group of RNA processing, coordinator of the bioinformatics program at the Centre for Genomic Regulation (CRG) and professor at Pompeu Fabra University, “knowledge of the human genome was a crucial contribution to our understanding of human biology“.
Computational biology and technology
This titanic project would not have been possible without computational biology, the science that applies theoretical models to analyzing biological data. And the advancement of computational biology has been tied to that of technology.
“In the first programming courses I took in the calculation lab of the University of Barcelona we used perforated cards. At that time, computers did not have screens, and when you wanted to communicate with the computer, you put the card on it and the result was printed on a sheet”, explains Guigó when he remembers how was his degree in the mid-eighties.
Roderic recalls that years later at the University, they had the first computer with three screens connected. But computer science was not yet available to everyone. “During the thesis I already had an e-mail address. The problem was that since no one else had it… it was not useful!”. This is why, for Roderic, the actual technological revolution was to have a personal computer and be able to work from home.
A revolution that David Comas, leader of the human genome diversity group at the Institute for Evolutionary Biology (IBE: CSIC-UPF) and current director of the Department of Experimental and Health Sciences, Pompeu Fabra University (DCEXS-UPF) lived in a similar way: “With the knowledge and technique we have today, I would have done my PhD thesis in a fortnight instead of three or four years! It would take a week to produce and genotype the data, and the analysis would be done in four days”.
Identifying and visualizing genes
Roderic has always been motivated by technical challenges. “I wanted to develop programs to find out where the genes were in the DNA sequence. I wanted to understand what the signals and letter combinations were that the cellular machinery recognized and used to make proteins. So he went to Boston to do a postdoctoral fellowship at the Smith Temple Laboratory, one of the first bioinformatics to develop algorithms to compare and align DNA sequences.
“I wanted to develop programs to find out where the genes were in the DNA sequence”
Roderic Guigó – CRG
In 1994, the Catalan bioinformatician returned to Barcelona and established his own group at the Hospital del Mar Medical Research Institute (IMIM), where he continued to identify these tiny regions of the genome (2%) that correspond to the genes. “At the end, we were well-known in the field because of our experience identifying genes and also for our software that allowed us to visualize the different annotated regions of the genome“, says Guigó.
The human genome project
The public project to sequence the human genome began in 1990, but the early years were not very successful. In 1998, the american biochemist Craig Venter questioned the operation of the public project. It was expensive, costly, and they used an inefficient technology. And his company, Celera Genomics, joined the race by applying ‘Shotgun’ technology to sequence multiple DNA fragments simultaneously.
“Celera was a very open-minded company in the sense that they immediately had the participation of experts in the field, and they contacted us to use our software to visualize the genome of the fruit fly“, says Guigó. This first work meant that a few months before publishing the sequence of the human genome, Celera contacted them again to visualize the sequence of the human genome.
“When we were developing the software, here in Barcelona, I thought I was working on something that was important, but just as many scientists consider that their research is important. However, when I went to Celera Genomics with my PhD student Josep Abril, to work on the first draft of the human genome sequence, we were aware we were participating in a project that would go down in history”.
Finally, on February 12, 2001, Nature and Science announced the publication of the human genome sequence, with the results of public and private initiatives.
The Genome Encyclopedia
Beyond the historical milestone, having the human genome sequence did not cause any revolution. “After 2001 genomics advanced in two fields. On one hand, scientists wanted to unravel the differences in genome sequence between different individuals and populations, and on the other hand, it was necessary to know which regions of the genome were functional“. To meet this latest challenge, the project to generate the encyclopedia of functional DNA elements (ENCODE) was launched in 2003; a project that is still underway and in which Roderic’s group participated from the very beginning.
“We thought that much of the genome had no function. That it was junk DNA resulting from viral transposons”, says Guigó. Now we know that genes represent just 2% of the genome sequence, but other functional regions involved in cellular regulation or gene expression have been identified. The biologist David Comas confirms that “the concept junk DNA is highly questioned because we still have a lot to learn about the functions of many DNA regions and their variants.”
Studying diversity and evidence of bias
Once having the sequence, scientists realized that 99.9% of the genome is common to all humans. But do we know the remaining 0.1%? In 2008, the 1000 Genomes Project began to sequence the genomes of people from different regions of the planet. However, in the opinion of Comas, expert on the diversity of the human genome, “we know the most common variants of our genome, but we have forgotten some small populations that are less represented.”
“We know the most common variants of our genome, but we have forgotten some small populations that are less represented”
David Comas – UPF-DCEXS and IBE
To better discover genome diversity, over the years, “the genome of specific populations that had some associated diseases has been sequenced“. Besides, national initiatives have also emerged, to sequence the genome of the entire population of Iceland or 100,000 genomes of people in the United Kingdom. Projects that have a more local scope but “also consider the phenotypic data of the individuals studied, enabling us to correlate what we see with the genetic variants”, explains Comas.
Despite these efforts, we have a biased knowledge of genome diversity. And that has consequences. In Comas’s words, “infrequent variants, which are specific of small, underrepresented populations, are often involved in biomedical aspects“. And not knowing this diversity or having a low representation of it means that when the data is analyzed, “the less frequent variants are often removed from the analyses because the algorithms take them as sequence errors“. And not only that; there are also technical biases as “it is much easier to study variations that affect a single nucleotide than structural variants that generate large changes.”
Leading into the future of human genomics
The future of genomics lies in understanding the functional implications of changes in sequence. “Historically, we have seen the genome as something very cold and static, as a linear sequence. And we still don’t know how different regions of the same sequence interact, “explains Comas.
This future has already begun with genomic data repositories such as the EGA (European Genome-Phenome Archive) – with one of the two nodes in the Barcelona Biomedical Research Park (PRBB), coordinated by the team led by Arcadi Navarro – and with projects such as genotype-tissue expression (GTEx), in which Guigó’s lab participates. The latter project attempts to understand the relationships between genome sequence variation between individuals and gene expression in different tissues. Although we still need to be able to explain “how the genomic variation of people has an impact on their function and results in differences such as, for example, the predisposition to have certain diseases,” concludes Guigó.