Identifying problems
There are a number of reasons a researcher might assign a sequence to the wrong organism, including microbial contamination in samples, chimerism (when the genomes of two organisms combine during the DNA amplification process), poor taxonomic identification, or even simple human mix-ups during sample preparation.
The extent of the mislabeled sequence problem remains a matter of speculation, but a few studies have lent some insight. Earlier this year, for instance, Langdon searched a subset of data from the 1,000 Genomes Project for possible contamination. “About 7 percent of samples have Mycoplasma contamination,” he said.
Another study this year found Bradyrhizobium as a common sequence contaminant in eukaryotic sequences. For instance, sequences assigned to taxa as diverse as a Tibetan antelope, a fungus, a protozoa, and Homo sapiens are all Bradyrhizobium. “The problem is much more extensive,” Martin Laurence, the founder of ShipShaw labs who led the study, told The Scientist in an e-mail. “I have a long, unpublished list of contaminated sequences, since the DNA extraction kits I use are also contaminated, so I end up seeing a zoo of animals in my human clinical species (parrot sequences are particularly popular),” he continued. “Obviously, there were no parrots or Tibetan antelopes anywhere near my samples.”
Evolutionary biologist Stephen Smith at the University of Michigan builds large phylogenetic trees of plants. In one project, on a group of plants including cacti and carnivorous species, Smith analyzed about 4,000 organisms that had enough overlapping sequences in GenBank to make a tree. “Something on the order of 1 to 2 percent of what I used to build this tree is mislabeled,” he said. “It’s not a big number, but if you care where species fall within the phylogeny, it does make it a big deal.”
While it may be apparent that a sequence is mislabeled in GenBank, only the person who submitted the errant entry can correct it. While there are procedures to alert the database administrators to problems, it’s a laborious task for them to contact the submitters and investigate each case. Mislabeled submissions are sometimes corrected, but often they remain in the database.