NEW: Find great deals on the latest gadgets and more in the ScienceDaily Store!
Science News
from research organizations

Millions Of New Genes, Thousands Of New Protein Families Found In Ocean Sampling Expedition

March 14, 2007
J. Craig Venter Institute
Researchers from the J. Craig Venter Institute announced the discovery of millions of new genes, thousands of new protein families and specifically the characterization of thousands of new protein kinases from ocean microbes using whole environment shotgun sequencing and new computational tools. Researchers believe these data will lead to better understanding of key biological processes which could eventually offer new ideas for alternative energy production and could offer solutions to deal with climate change and other environmental issues.

The Sorcerer II at anchor in the Marquesas Islands in French Polynesia.
Credit: J. Craig Venter Institute

Researchers from the J. Craig Venter Institute (JCVI) today announced the publication of several studies from the Sorcerer II Global Ocean Sampling Expedition (GOS) in PLoS Biology detailing the discovery of millions of new genes, thousands of new protein families and specifically the characterization of thousands of new protein kinases from ocean microbes using whole environment shotgun sequencing and new computational tools.

Researchers believe these data will lead to better understanding of key biological processes which could eventually offer new ideas for alternative energy production and could offer solutions to deal with climate change and other environmental issues.

The GOS dataset is 90-fold larger than other marine metagenomic datasets, thus making it the largest ever released in the public domain. The GOS analysis also nearly doubles the number of previously known proteins. This enormous amount of data allowed the researchers to better understand the genomic structure and evolution of microorganisms, as well as the function of important protein families such as protein kinases, which are key regulators of cellular function in all organisms. Although invisible to the naked eye, microbes make up the vast majority of life on the planet and are responsible for creation and maintenance of Earth’s atmosphere, it is important to understand the role and function of these organisms to ensure the survival of the planet and human life on it.

“This publication is not only providing an unprecedented level of new genes and protein family discoveries, but is also pivotal in that we have provided compelling analysis of evolution and function of these genes and proteins within the larger context of organisms interacting with their environment,” said J. Craig Venter, Ph.D., founder and chairman, the J. Craig Venter Institute. “Given the findings, it’s clear that we’ve only begun to scratch the surface of understanding the microbial world around us.”

The Sorcerer II Expedition began with a pilot project in 2003 in the Sargasso Sea near Bermuda in which more than one million new genes and hundreds of new photoreceptors were discovered in what was thought to be an area of low diversity. The GOS publication today is a result of ocean water sampling conducted from Halifax, Nova Scotia to the Eastern Tropical Pacific during the two year circumnavigation by the Sorcerer II Expedition. The Gordon and Betty Moore Foundation and the United States Department of Energy, Office of Science, funded the sequencing and analysis of the Expedition. The JCVI funded the operation of the vessel.

The group also announced today the launch of a new online database and high-speed computational resource, Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis (CAMERA). Funded by a grant from the Moore Foundation of $24.5 million over seven years, CAMERA was developed by the UC San Diego Division of the California Institute for Telecommunications and Information Technology (Calit2) in partnership with JCVI and UCSD’s Center for Earth Observations and Applications (CEOA) at Scripps Institution of Oceanography.

"The scale and complexity of the GOS data required Calit2 to architect a powerful new cyberinfrastructure to enable both interactive access as well as high performance computation on the data by the global metagenomic community, " said Larry Smarr, Calit2 director and principal investigator on CAMERA.

CAMERA houses metagenomic data and provides the advanced software tools and computer hardware to analyze these data. Using dedicated optical circuits, CAMERA permits scientists to connect their local laboratory computers directly to the CAMERA database and tools using the National LambdaRail or international optical circuits, resulting in up to a hundred-fold increase in bandwidth over current standards. CAMERA has been in beta testing since January 2007 and today is available to researchers worldwide. In addition to the CAMERA database, the GOS data is also being deposited in the U.S. National Institutes of Health’s public database, GenBank.

The GOS publication was a result of intensive analysis of these data by scientists from the JCVI along with collaborators at four University of California campuses (San Diego, Los Angeles, Berkeley and Davis), University of Southern California, Salk Institute for Biological Studies, Burnham Institute, University of Hawaii, Brown University, Universidad Nacional Autonoma de Mexico, Universidad de Costa Rica, Universidad de Concepcion, Bedford Institute of Oceanography, Smithsonian Tropical Research Institute, and Rutgers University.

The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Pacific

Rusch et al. describe the results of metagenomic analysis of 37 samples taken aboard Sorcerer II during its voyage between Halifax, Nova Scotia and French Polynesia in 2003 to 2004, combined with seven samples collected during the pilot study in the Sargasso Sea. To capture the DNA, scientists onboard the Sorcerer II collected water every 200 nautical miles and then filtered it through progressively smaller filters to collect bacteria and then viruses. The DNA extracted for these publications were from the filter that collects mostly bacteria.

The group analyzed a massive dataset consisting of 7.7 million DNA sequences totaling 6.3 billion base pairs. Following from the Sargasso Sea pilot study, they continued to find a great degree of diversity both within and across the sampling sites. Researchers identified 60 highly abundant ribotypes (roughly equivalent to species) however, the inter-species variation and the variation of organisms within the same environment suggests that while the microbes might be similar at an rRNA level they can differ greatly at a biochemical and genomic level.

While variation is known to be closely linked to environment, this all encompassing genetic survey has identified new and unexpected links between variation and the environment. For example, the class of proteins known as proteorhodopsins absorbs either blue or green light. This study revealed that the blue and green variants are found in different environments with blue light preferred in the open ocean and green light preferred in coastal environments. Identifying these associations should greatly enhance our understanding of marine systems and the environmental factors upon which they depend.

To handle the enormous volume of data generated from this phase of the Expedition, the team developed new computational methods to assemble and analyze these data. One comparative genomic method termed “fragment recruitment,” allowed researchers to look at genome structure, microbial evolution, and diversity on many levels. Another, “extreme assembly”, as the name implies, enabled researchers to assemble very large segments of DNA from the abundant but previously hard to analyze genomes of organisms. Finally, they developed a tool to assess the similarity between whole metagenome databases.

According to lead author Doug Rusch, Ph.D., computational scientist at the JCVI, “We know so little about the organisms in our environment mostly because we have lacked the genomic and computational tools for understanding and examining these organisms. We believe that this publication and the new tools we developed will help to unleash a new era of enhanced knowledge of the biological processes of microbial communities and this new understanding will begin to unlock the mysteries of unseen life.”

The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families

Characterization of microbial communities has been limited in the past by the difficulty in culturing organisms in the laboratory. With whole environment shotgun sequencing techniques, environments such as ocean microbial communities can now be better understood at the DNA and protein level.

Yooseph et al. report on the 6.12 million new proteins uncovered from 7.7 million GOS sequences by using a novel sequence clustering approach. This nearly doubles the number of known proteins. The researchers found that the GOS dataset covered almost all of the known prokaryote (bacterial and archaeal) protein families and that there were 1,700 totally unique large protein families in the GOS dataset, not matching any known families. A surprising number of the new protein families discovered are in viruses. Researchers were also able to match 6,000 previously unmatched sequences in current protein databases to proteins found in the GOS dataset.

Given the extraordinary rate of discovery of new proteins and protein families, the researchers conclude that there are likely many more protein families to be discovered both in microbes and viruses given the rate of discovery in this first phase of the GOS Expedition. The data also suggest that this is much more yet to be discovered about biological diversity of microbes.

The team also found that several protein domains (the conserved structural units in proteins) that were previously thought to exist only in one of the four kingdoms of life (bacteria, archaea, eukaryotes, viruses) have GOS examples in another kingdom. These kingdom-crossing families may be proteins whose lineages are more ancient than previously assumed or they may have arisen due to lateral gene transfers.

To assess the impact of the GOS data on known protein families, the team also investigated several protein families in detail. In addition to increasing substantially the size and diversity of these families, the GOS sequences increased the understanding of the evolution and function of these proteins.

One example is those that repair DNA damage due to UV light (photolyases). While sunlight has benefits to the microbes, like with humans, sunlight also has the potential to be harmful to cells exposed to it. The team discovered many new proteins that protect these organisms from UV ray damage and some that are involved in repairing UV damage. These proteins were found in all organisms in the dataset, even in viruses.

Another example is glutamine synthetase (GS), the protein that plays a key role in nitrogen metabolism. More than 9,000 GS or GS-like sequences were uncovered, with a large number of sequences of type II GS (one of the three GS types). This was unexpected because type II GS is associated more with eukaryotes, not bacteria and viruses, and not many eukaryotes are expected in the filters that were analyzed. The researchers theorize that this could be due to lateral gene transfer from eukaryotes, or more likely due to gene duplication before prokaryotes and eukaryotes diverged into two branches of life.

Shibu Yooseph, Ph.D., lead author and computational scientist at JCVI said, “The analysis we have done so far with this publication shows a tremendous diversity of organisms at the protein level and going forward, I think we will continue to see this tremendous amount of diversity. These data open up a whole new set of research efforts from a computational perspective in designing better tools to be able to deal with this sort of data, as well as making observations on evolution and how functions evolved for these protein families”

Structural and Functional Diversity of the Microbial Kinome

The availability of the GOS metagenomic data along with other large microbial genome data sets is enabling more research into specific kinds of protein families. Of particular interest to a wide variety of researchers are kinase families. Protein kinases are protein enzymes that regulate many of the most basic cellular functions in humans and other eukaryotes. They are key targets for cancer and other disease drug development.

Previously, it was thought that different families of kinases were responsible for these types of cell regulation in prokaryotes (bacteria) versus eukaryotes (animals and other non-bacteria). Eukaryote protein kinases (ePK) were most common in eukaryotes, histidine kinases in bacteria. However, in their PloS Biology publication Kennan et al. show that with the scope and diversity of the GOS data that ePK-like kinases (ELKs) are indeed very prevalent in bacteria, in fact, more so than histidine kinases. This finding is even shedding some light on human kinases.

The research team has shown that the ePK is just one family in a diverse superfamily of enzymes that all share a common protein kinase-like (PKL) fold (shape). Using sensitive profile methods, the researchers discovered more than 45,000 kinase sequences from the GOS and other public data sources and grouped these into 20 diverse families, of which ePKs were just one. The GOS data doubles the size of most PKL families and triples the number of known ePK-like kinases (ELK). Many of these families exhibited eukaryote-like structure and function of their proteins and thus the researchers conclude that several of these protein families existed before the divergence of the three domains of life.

The authors concluded this work shows the power of metagenomic data in allowing better understanding of any gene family and has opened the door to further research into the mechanisms of protein families and their function, structure and evolution.

Reference: PLoS Biology (

Story Source:

Materials provided by J. Craig Venter Institute. Note: Content may be edited for style and length.

Cite This Page:

J. Craig Venter Institute. "Millions Of New Genes, Thousands Of New Protein Families Found In Ocean Sampling Expedition." ScienceDaily. ScienceDaily, 14 March 2007. <>.
J. Craig Venter Institute. (2007, March 14). Millions Of New Genes, Thousands Of New Protein Families Found In Ocean Sampling Expedition. ScienceDaily. Retrieved October 1, 2016 from
J. Craig Venter Institute. "Millions Of New Genes, Thousands Of New Protein Families Found In Ocean Sampling Expedition." ScienceDaily. (accessed October 1, 2016).