Number 202 (Story #1), November 9, 1994 by Phillip F. Schewe and Ben Stein
SO-CALLED "JUNK" DNA , regions of genetic material (accounting for 97% of the human genome) that do not provide blueprints for proteins and therefore have no apparent purpose, have been puzzling to scientists. Now a new study shows that these non-coding sequences seem to possess structural similarities to natural languages. This suggests that these "silent" DNA regions may carry biological information, according to a statistical analysis of DNA fragments by researchers at Boston University and Harvard Medical School (contact H.E. Stanley of Boston University, 617-353-2617). Studying DNA sequences from humans, viruses, bacteria, yeast, and other organisms, the researchers performed statistically-based linguistics tests on the 37 known DNA sequences each having at least 50,000 "base pairs" or "letters" of DNA code. The researchers first performed a variation of a test known as Zipf analysis, in which the words from a text are arranged on an x-axis from most frequently occurring to least frequently occurring; plotted against their rank is the actual number of occurrences of that word in the text. For natural languages one invariably gets a straight line (on a graph using logarithmic axes) whose slope is about -1. The non-coding DNA sequences had linear slopes when base pairs were grouped into genetic "words" consisting of 3, 6, 7, or 8 base pairs. Interestingly, the slope values for non- coding sequences were closer to -1 than for coding DNA, supporting a hypothesis that protein- coding DNA may be more like a compressed computer file than a natural language. (R.N. Mantegna et al., upcoming article in Physical Review Letters.)
|