A software program that has been successfully annotating the genes of common bacteria since 1992 is now capable of finding genes in higher organisms. It is particularly useful for finding human genes in anonymous human DNA sequences.
Understanding the genomes of key microorganisms may increase understanding of human genetics because lower organisms have some genes that correspond to human genes. Also scientists can design new drugs based on knowledge of disease-causing bacteria.
The original software program, called GeneMark, uses probabilistic mathematical models to predict the locations of genes on a strand of DNA. GeneMark was developed by Dr. Mark Borodovsky, a professor of biology at the Georgia Institute of Technology. It has become the world's most-used software program for deciphering bacterial DNA and has proven itself 98 percent accurate.
Borodovsky's latest development uses GeneMark.hmm, a refined version of the original program, as its base to make more sophisticated predictions for the genomes of eukaryotic, or higher organisms.
"Deciphering bacterial DNA is simpler than deciphering human DNA since its genes run continuously, without gaps," Borodovsky explained. "The genes of human DNA may be divided into pieces, called exons, with non-coding genetic material between the exons. These spacers in the genes, called introns, were hard to detect by a computer algorithm. Also, eukaryotic DNA is much longer, with an average gene size of 10,000 nucleotides."
Therefore, the predictions of where eukaryotic genes lie on a strand of DNA must include predictions of the boundaries between the exons, which contain the genetic information, and introns, which are the non-coding regions.
To create a computer program to achieve this, Borodovsky employed a probabilistic mathematical model called Hidden Markov Models or HMM. A recent grant from the National Institutes of Health (NIH) is funding incorporation of HMM into GeneMark, making the program responsive to the boundaries between exons and introns.
Borodovsky developed GeneMark.hmm with Dr. Alexander Lukashin, a researcher who works in the lab. A test of the program demonstrated its "state-of- the-art accuracy," said Borodovsky, meaning, when tested against current means of finding eukaryotic genes, GeneMark.hmm performed at least as well as the best current methods.
"GeneMark.hmm is more than a static software program or product," Borodovsky noted. "It is rather an approach for DNA sequence analysis that is under continuous development."
It is already being used to annotate parts of the genomes of five eukaryotic organisms, including humans, nematodes, fruit flies and a plant in the mustard family.
Borodovsky will present his latest results at the International Workshop on Genomic Sequence Analysis on Dec. 1-4 at the Issak Newton Institute for Mathematical Sciences at the University of Cambridge in England.
GeneMark.hmm will fill a need, as evidenced by early demand from scientists, Borodovsky said. Even before information about GeneMark.hmm has been published in a scientific journal, almost 30 researchers expressed interest to one of Borodovsky's graduate students, John Besemer, who gave a poster presentation on GeneMark.hmm at a recent conference on the eukaryotic organism Chlamydomonas reinhardtii.
Meanwhile, Borodovsky has recorded his research in predicting gene coding regions in a chapter of new book "Organization of the Prokaryotic Genome," soon to be published by the American Society of Microbiology. The chapter is called "Statistical Predictions in Genuine Coding Regions."
Borodovsky, a Russian emigre, conceived the idea for GeneMark while still living in Russia in the 1980s. He envisioned a software program based on Markov models to manage the vast amounts of genetic information scientists were churning out.
The Russian mathematician, Andrey Markov, introduced his models early in the 20th century. Borodovsky believed Markov models could portray genes by the frequency of certain combinations of bases in known genes, contrary to non-genes. Therefore these probabilistic models could be applied to DNA sequences to predict where genes would lie on DNA.
When scientists sequence DNA, they are left with strings of nucleotides that need to be separated into genes and non-coding regions and then translated into proteins to make sense.
Since 1992, researchers from around the world have sent their sequenced DNA fragments via e-mail to Georgia Tech's GeneMark e-mail server, which predicts locations of genes. After mapping gene locations, the computer program compares the newly predicted protein sequence to known ones in a database. This determines protein function. The protein analysis is done in collaboration with the National Center for Biotechnology Information at the NIH.
GeneMark has proven itself a powerful tool for finding bacterial genes, in particular. Researchers at the Institute for Genomic Research have used GeneMark to sequence the complete genomes of numerous common bacteria.
GeneMark Genesis, the refined version of GeneMark, which Borodovsky developed with graduate student William Hayes, was used to find genes in genomes of the bacteria Methanoccocus jannaschii and Helicobacter pylori. There were no experimentally studied segments of M. jannaschii available to train the Markov models, upon which gene prediction is based in GeneMark. So the new program "learned Markov models from anonymous sequences based on the grammar of the genetic code," Borodovsky said.
Borodovsky's work is at the forefront of a new interdisciplinary field called bioinformatics, which uses mathematical methods and computers to answer many important biological questions. Bioinformatics can also help discover genes and design new drugs. Borodovsky is spearheading development of Georgia Tech's new master's degree program in bioinformatics, the first such program in the United States.
The above story is based on materials provided by Georgia Institute Of Technology. Note: Materials may be edited for content and length.
Cite This Page: