Day 10 Evening Lecture Notes
Steve Williams, Smith College
June 15, 2004
Given 600-800 bp sequences from a single gel, how do we perform additional reassembly? We need appropriate primers to generate many overlapping sequences.
Note that restriction digests cannot be reassembled because they have no overlaps. This problem can be overcome by using more than one restriction enzyme on the original uncut DNA. With luck, all junctions will be crossed by sequences in the other digest. The enzymes used should have short recognition sequences so as to get many small fragments in the digest. There are special kits which allow the production of large (up to 30 kbp) PCR products to fill in gaps in a sequence by making primers from segments that were already sequenced.
Primers are only about $1/bp. The first primer can always be made from the end of the vector cloning site; then make the second primer from the first sequence, the third primer from the second sequence, etc. To overcome accuracy problems in areas with compressions, it's customary to sequence both strands as a check. Most genome projects use 7x oversampling. Note that the new primers determined from the previous sequence are the same as the sequence, not the complement.
The problem with walking is that it's inherently serial and therefore slow. Note that designing both forward and reverse primers at the same time will give about a factor of 2 speed-up.
Randomly sheared DNA can be blunt-cloned into a vector and used with vector primers. Randomly sheared DNA can be hard to reassemble if there are repeated sequences so that more than one solution exists.
Use nucleases to move the vector primer site through the insert in steps via a kit from a company like Exo-Size.
Use transposases to randomly pop primers into an interesting sequence at different places. Only works because there will be only transposition per vector.
STRs are very hard to sequence as primers will find multiple binding sites. Centromeres, telomeres and LINE1 all have this problem. Two LINE1 repeats are typically about 90% the same. Should this expensive sequencing even be performed?
cDNA library sequences are often put into databases after 1 pass of sequencing and are often unreliable. For new sequences GenBank will require sequencing of both strands.
Primers shouldn't be designed to hybridize to the very end of an unknown sequence because the data quality there is always very poor.
Most sequence nowadays are run with thermal cycling, bases analogues and denaturing gels, all the tricks at once. Some companies now will even prepare a sequence from a colony on a gel without any further preparation by the researcher. Such facilities usually ask to see a quantitation gel photo first. A general rule in molecular biology is to perform lots of controls and checks all the time as doing so saves effort in the long run.
How did Venter's group obtain a species count from such a mishmash of DNA fragments from so many species? By analyzing ribosomal DNA sequences that are highly conserved over evolution but slightly different from species to species. Primer design and reassembly are therefore relatively straightforward for these species. From the ribosomal sequences some further walking backward and forward on the genomes could be done, but sequencing the whole genomes of the organisms was not the point.
Molecular biological ecosystem surveys are becoming more common. Some companies like Diversa specialize in looking for new restriction enzymes in the field. People are trampling all over Yellowstone and the Antarctic looking for new extremophiles. A colleague of Bart's sequenced his own septic system and found two new restriction enzymes.
Up Previous Next