Computing: Website and Database Applications

Web/DB Home   Home   Contact

Biology: DNA molecular weight calculator.


1. DNA sequences introduction.
 
DNA is a polymer composed of four molecules, called nucleobases, nitrogenous bases, nucleic acid bases, or simply bases. Their names and one-letter abbreviations are adenine (A), cytosine (C), guanine (G), and thymine (T). The bases are joined end to end to form a single strand of DNA (single-stranded DNA). In the cell, DNA usually appears in a double-stranded form, with two strands wrapped around each other in a double helix shape. The two strands of the double helix have matching bases, known as the base pairs. In the DNA double helix, an A on one strand is always opposite a T on the other strand, and a G is always paired with a C.
If you add a sugar (2-deoxyribose in DNA) to the bases, you get the corresponding nucleosides: adenosine, guanosine, cytidine, thymidine (for DNA, more correctly: deoxyadenosine...). You can further add a phosphate and get the corresponding nucleotide (or nucleoside monophosphate): adenosine monophosphate (adenylic acid), guanosine monophosphate (guanylic acid), cytidine monophosphate (cytidylic acid), thymidine monophosphate (thymidylic acid). The polymer of these nucleotides forms a deoxyribonucleic acid (DNA). Have a look at a biochemistry book for details.
There is also an orientation to the strands. One end of a nucleotide is called the 5' (five prime) end, and the other is called the 3' (three prime) end. When nucleotides join to make a single strand of DNA, they always connect the 5' end of one to the 3' end of the other. Furthermore, when the cell uses the DNA, as in translating it to RNA, it does so base by base from the 5' to the 3' direction. Thus, when DNA is written, it's done so left to right on the page, corresponding to the 5' to 3' orientation of the bases. When two strands are joined in a double helix, the two strands have opposite orientations. That is, the 5' to 3' orientation of one strand runs in an opposite direction as the 5' to 3' orientation of the other strand: At each end of the double helix, one strand has a 3' end; the other has a 5' end. Because the base pairs are always matched A-T and C-G and the orientation of the strands are the reverse of each other, the term reverse complement describes the relationship of the bases of the two strands. It's "reverse" because the orientations are reversed, and "complement" because the bases always pair to their complementary bases, A to T and C to G.
As DNA is essentially a polymer, made from 4 building blocks, the nucleic acids, attached end to end, it's possible to summarize the structure of a DNA molecule by simply giving the sequence of nucleic acids (or bases). Thus, a DNA sequence may be represented by a string (sequence of characters), composed of of the letters A, C, G, and T, representing the 4 DNA nucleic acids (normally these codes are written as uppercase; however, lots of bioinformatics applications also accept lowercase base codes).
As not all bases of a sequence are always known, there are also single letter codes for all possible groups of two, three, or four nucleic acids. Here's the complete table of standard IUB/IUPAC nucleic acid codes (note, that U = Uracil is a base, present in RNA sequences, corresponding to T in DNA).
CodeNucleic acid(s)
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
M A or C (amino)
R A or G (purine)
W A or T (weak)
S C or G (strong)
Y C or T (pyrimidine)
K G or T (keto)
V A or C or G
H A or C or T
D A or G or T
B C or G or T
N A or G or C or T (any)
Sequence formatting: The FASTA format.
The simplest way to create a DNA sequence file, is to use raw data, i.e. saving it as a text file, containing one or several lines of strings, formed of a sequence of the base codes. However, apart from the bases it contains, this tells us nothing about the sequence, in particular there is no indication, what molecule it is, and where it comes from. Over time, biologists and programmers have invented several ways to format sequence data in computer files, in order to create standard representations of sequences, including all kind of information about the sequence. There are lots of such formats; the most widely used are FASTA and GenBank.
FASTA format is basically just lines of sequence data with newlines at the end, so it can easily be printed on a page or displayed on a computer screen. The length of the lines isn't specified, but for compatibility, it's best to limit them to 80 characters in length. There is also a FASTA header: one (or several) line(s) at the beginning of the file, and starting with the greater-than (>) character. The FASTA header can contain any text whatsoever (or no text). Typically, a header line contains the name of the DNA or the gene it comes from, often separated by a vertical bar (|) for additional information about the sequence, the experiment that produced it, or other, non-sequence information of that nature. Most FASTA-aware software insists that there must be only one header line. The addition of comments (starting with a # character) is not officially supported.
If you add several FASTA-formatted sequences to the same file, you get a multiple sequence FASTA file. These are not part of the FASTA format, but several bioinformatics applications and web-based user-interfaces accept such files.
2. DNA molecular weight online calculator.
 
"DNA molecular weight online calculator" is a web-application, that may be used to calculate the molecular weight of one or more DNA sequences. The base codes may be uppercase or lowercase (transformation to uppercase in this case). U (uracil) will be rejected as an invalid DNA base. Spaces, end-of-line and tab characters are removed, before the sequence is validated. The sequences may be entered either as raw data, either in FASTA format. The FASTA headers must be one single line; comments are not permitted. Multiple-sequence FASTA is supported.
You can enter the sequence manually (for example, using Copy/Paste) or upload a file, stored on your computer. To use a file, just select the corresponding checkbox. Please, note, that the filesize is limited to 25kB and that the filename must be all letters, numbers, spaces, underscores or hyphens.
You can choose, if you want to do the molecular weight calculation for single-stranded DNA (the sequence, that you entered) or for double-stranded DNA (the entered sequence plus its reversed complement). You can also choose the DNA sequence topology: linear or circular. Linear sequences are assumed to have a 5' phosphate.
Molecular weight is calculated by adding the molecular weights of the sequence's bases (resp. the sequence's and reverse complement's bases):
    A = 313.21; C = 289.18; G = 329.21; T = 304.2.
For linear sequences, 17.01, corresponding to 1 supplementary oxygen and 1 supplementary hydrogen atom as parts of the "free" 5' phosphate, is added.
If the sequence contains extended base codes, two molecular weights are calculated. The minimum molecular weight, corresponding to the molecular weight of the sequence, where all uncertain bases are the ones with the smallest molecular weight. The maximum molecular weight, corresponding to the molecular weight of the sequence, where all uncertain bases are the ones with the highest molecular weight. For example, for the base code D (that may either be A or G or T), the minimum molecular weight is 304.2 (the one of T), the maximum molecular weight is 329.21 (the one of G).
Use the following link to start the online application. For details about the Perl source code, see below...
3. DNA molecular weight calculator Perl script.
 
Download not yet ready...