Computing: Website and Database Programming

Biology: DNA molecular weight calculator.


1. DNA sequences introduction.
 
Deoxyribonucleic acid (DNA) is a polymer composed of four molecules, called nucleobases, nitrogenous bases, nucleic acid bases, or simply bases. Their names and one-letter abbreviations are adenine (A), cytosine (C), guanine (G), and thymine (T). The bases are joined end to end to form a single strand of DNA (single-stranded DNA). In the cell, DNA usually appears in a double-stranded form, with two strands wrapped around each other in a double helix shape. The two strands of the double helix have matching bases, known as the base pairs. In the DNA double helix, an A on one strand is always opposite a T on the other strand, and a G is always paired with a C.
If you add a sugar (2-deoxyribose in DNA) to the bases, you get the corresponding nucleosides: adenosine, guanosine, cytidine, thymidine (for DNA, more correctly: deoxyadenosine...). You can further add a phosphate and get the corresponding nucleotide (or nucleoside monophosphate): adenosine monophosphate (adenylic acid), guanosine monophosphate (guanylic acid), cytidine monophosphate (cytidylic acid), thymidine monophosphate (thymidylic acid). The polymer of these nucleotides forms a deoxyribonucleic acid (DNA). Have a look at a biochemistry book for details.
There is also an orientation (or directionality) to the strands. One end of a nucleotide is called the 5' (five prime) end, and the other is called the 3' (three prime) end. When nucleotides join to make a single strand of DNA, they always connect the 5' end of one to the 3' end of the other. Furthermore, when the cell uses the DNA, as in transcripting it to RNA, it does so base by base from the 5' end to the 3' end of the molecule. Thus, when DNA is written, it's done so left to right on the page, corresponding to the 5' to 3' orientation of the bases. When two strands are joined in a double helix, the two strands have opposite orientations. That is, the 5' to 3' orientation of one strand runs in an opposite direction as the 5' to 3' orientation of the other strand: At each end of the double helix, one strand has a 3' end; the other has a 5' end. Because the base pairs are always matched A-T and C-G and the orientation of the strands are the reverse of each other, the expression reverse complement describes the relationship of the bases of the two strands. It's "reverse" because the orientations are reversed, and "complement" because the bases always pair to their complementary bases, A to T and C to G.
As DNA is essentially a polymer, made from 4 building blocks, the nucleotides, attached end to end, it's possible to summarize the structure of a DNA molecule by simply giving the sequence of the nucleotides (sequence of the bases). Thus, a DNA sequence may be represented by a string (sequence of characters), composed of of the letters A, C, G, and T, representing the 4 DNA nucleic acids (normally these codes are written as uppercase; however, lots of bioinformatics applications also accept lowercase base codes).
As not all bases of a sequence are always known, there are also single letter codes for all possible groups of two, three, or four nucleic acids. Here's the complete table of standard IUB/IUPAC nucleic acid codes (note, that U = Uracil is a base, present in RNA sequences, corresponding to T in DNA).
CodeNucleobase
A Adenine
C Cytosine
G Guanine
T Thymine
U Uracil
 
CodeNucleobases
M A or C (amino)
R A or G (purine)
W A or T (weak)
S C or G (strong)
Y C or T (pyrimidine)
K G or T (keto)
 
CodeNucleobases
V A or C or G
H A or C or T
D A or G or T
B C or G or T
N A or G or C or T (any)
Sequence formatting: The FASTA format.
The simplest way to create a DNA sequence file, is to use raw data, i.e. saving it as a text file, containing one or several lines of strings, formed of a sequence of the base codes. However, apart from the bases it contains, this tells us nothing about the sequence, in particular there is no indication, what molecule it is, and where it comes from. Over time, biologists and programmers have invented several ways to format sequence data in computer files, in order to create standard representations of sequences, including all kind of information about the sequence. There are lots of such formats; the most widely used are FASTA and GenBank.
FASTA format is basically just lines of sequence data with newlines at the end, so it can easily be printed on a page or displayed on a computer screen. The length of the lines isn't specified, but for compatibility, it's best to limit them to 80 characters in length. There is also a FASTA header: one (or several) line(s) at the beginning of the file, and starting with the greater-than (>) character. The FASTA header can contain any text whatsoever (or no text). Typically, a header line contains the name of the DNA or the gene it comes from, often separated by a vertical bar (|) for additional information about the sequence, the experiment that produced it, or other, non-sequence information of that nature. Most FASTA-aware software insists that there must be only one header line. The addition of comments (starting with a # character) is not officially supported.
If you add several FASTA-formatted sequences to the same file, you get a multiple sequence FASTA file. These are not part of the FASTA format, but several bioinformatics applications and web-based user-interfaces accept such files.
2. DNA molecular weight online calculator.
 
"DNA molecular weight online calculator" is a web-application, that may be used to calculate the molecular weight of one or more DNA sequences. The base codes may be uppercase or lowercase (transformation to uppercase in this case). U (uracil) will be rejected as an invalid DNA base. Spaces, end-of-line and tab characters are removed, before the sequence is validated. The sequences may be entered either as raw data, either in FASTA format. The FASTA headers must be one single line; comments are not permitted. Multiple-sequence FASTA is supported.
You can enter the sequence manually (for example, using Copy/Paste) or upload a file, stored on your computer. To use a file, select the corresponding checkbox. Please, note, that the file size is limited to 25kB and that filenames must be all letters, numbers, spaces, underscores or hyphens.
You can choose, if you want to do the molecular weight calculation for single-stranded DNA (the sequence, that you entered) or for double-stranded DNA (the entered sequence plus its reversed complement). You can also choose the DNA sequence topology: linear or circular. Linear sequences are assumed to have a 5' phosphate.
Molecular weight is calculated by adding the molecular weights of the sequence's bases (resp. the sequence's and reverse complement's bases):
    A = 313.21; C = 289.18; G = 329.21; T = 304.2.
For linear sequences, 17.01, corresponding to 1 supplementary oxygen and 1 supplementary hydrogen atom as parts of the "free" 5' phosphate, is added.
If the sequence contains extended base codes, two molecular weights are calculated. The minimum molecular weight, corresponding to the molecular weight of the sequence, where all uncertain bases are the ones with the smallest molecular weight. The maximum molecular weight, corresponding to the molecular weight of the sequence, where all uncertain bases are the ones with the highest molecular weight. For example, for the base code D (that may either be A or G or T), the minimum molecular weight is 304.2 (the one of T), the maximum molecular weight is 329.21 (the one of G).
Use the following link to start the online application.
3. DNA molecular weight calculator Perl script.
 
Click the following link to download the DNA molecular weight calculator Perl script and all other files needed to run this application on your web server. Have a look at the ReadMe.txt file, included in the download archive, for details about the different files, and where to place them on the server.
The Perl script is rather long and I do not display the source code here. Just some remarks, concerning how the application works:
  • The script first checks from where it should read the DNA sequence.
    • If the Load DNA sequence from local file checkbox is selected, it assumes that the user wants to upload a file containing the sequence. Thus, the user has to browse for the file before they push the Calculate button, allowing the script to get the filename in order to create a handle (an error message is displayed if this is not the case). File upload is always a potential danger that it could be used to hack the webserver and even the operating system. If you are not sure about this, you might want to have a look at my Uploading files using CGI and Perl tutorial (the tutorial example is my DNA molecular weight calculator application). If the filename is given and the filename is valid, the script creates a file handle and reads the file content into a string variable (thus, no file is saved on the server here).
    • If the Load DNA sequence from local file checkbox is not selected, the sequence is read from the text box.
  • The test web page is generated by reading a template file and replacing all custom tags (template lines containing a tag start with '#' and all tags are placed between '#' symbols) by the corresponding actual values, in particular the tag '#table#' will be replaced with a HTML table, showing the minimum and maximum molecular weights for the sequence(s) analyzed.
  • The main DNA molecular calculation routine (called calculateMolecularWeights) parses the sequence data, filters out the FASTA headers, if there are any, and for each individual sequence (remember that the DNA input may be multiple sequence FASTA format), calls the sub molecularWeightSeq (that, itself, calls molecularWeightBase for each base) to calculate the molecular weight of a single strand of DNA, depending on its topology. If the DNA is double-stranded, a second call, is made to the sub, in order to calculate the weight of the reverse complement of the sequence.
The script being to long to publish here, I said, this is not the case for some generally usable DNA related routines.
Calculate molecular weight(s) of a DNA sequence.
    sub molecularWeightSeq {
        # $dna is a single-stranded DNA sequence
        # The sequence may contain extended base codes
        # If the sequence includes invalid characters (incl. the RNA base code U), molecular weights will be set to 0

        my ($dna, $topology) = @_; my $mw1 = 0; my $mw2 = 0;
        my $oh = 17.01;   # molecular weight of 1 oxygen + 1 hydrogen
        if (validDna($dna)) {
            # If the sequence is valid DNA, calculate molecular weight(s) as sum of the bases weights
            for (my $i=0; $i < length($dna); $i++) {
                my ($bmw1, $bmw2) = molecularWeightBase(substr($dna, $i, 1));
                $mw1 += $bmw1; $mw2 += $bmw2;
            }
            # For a linear sequence, add the molecular weight of the supplementary O and H at the "free" 5' end
            if ($topology eq 'linear') {
                $mw1 += $oh; $mw2 += $oh;
            }
        }
        return($mw1, $mw2);
    }
Calculate molecular weight(s) of a DNA base.
    sub molecularWeightBase {
        # $base is a valid (standard or extended) DNA base code
        my ($base) = @_; my $mw1 = 0; my $mw2 = 0;
        # Extended to standard base code conversion; standard bases ordered by molecular weight!
        my %dna_extended = (
            'M' => 'CA', 'R' => 'AG', 'W' => 'TA', 'S' => 'CG', 'Y' => 'CT', 'K' => 'TG',
            'V' => 'CAG', 'H' => 'CTA', 'D' => 'TAG', 'B' => 'CTG', 'N' => 'CTAG'
        );
        # Molecular weights of standard DNA bases
        my %baseWeights = (
            'A' => 313.21, 'C' => 289.18, 'G' => 329.21, 'T' => 304.20
        );
        # Base molecular weight calculation
        if ($base =~ /^([ACGT])$/) {
            # Standard base code (A, C, G, T)
            $mw1 = $baseWeights{$base}; $mw2 = $mw1;
        }
        else {
            # Extended base code
            my $bases_extended = $dna_extended{$base};
            $mw1 = $baseWeights{substr($bases_extended, 0, 1)}; $mw2 = $baseWeights{substr($bases_extended, -1, 1)};
        }
        return($mw1, $mw2);
    }
Calculate reverse complement of a DNA sequence.
    sub reverseComp {
        # $dna is a valid DNA sequence (standard or extended base codes)
        my ($dna) = @_;
        # For each base of the sequence, determine the complementary base
        for (my $i = 0; $i < length($dna); $i++) {
            substr($dna, $i, 1, baseComp(substr($dna, $i, 1)));
        }
        # Reverse the entire sequence
        $dna = reverse($dna);
        return $dna;
    }
Calculate complement of a DNA base.
    sub baseComp {
        # $base is valid DNA base (standard or extended)
        my ($base) = @_;
        $base =~ tr/ATCGMRWSYKVHDBN/TAGCKYWSRMBDHVN/;
        return $base;
    }
Check DNA sequence validity.
    sub validDna {
        my ($dna) = @_; my $valid = 0;
        my $bases = 'ACGTMRWSYKVHDBN';
        if ($dna =~ /^([$bases]+)$/) {
            $valid = 1;
        }
        return $valid;
    }
If you want to place a link to the application on some other page, include the following into this page's HTML:
<a href="/cgi-bin/dna_molweight.pl">DNA molecular weight calculator</a>
4. Related stuff on this site.
 
If you need a more general application to determine the molecular weight of chemical molecules, maybe you'll like my Lazarus/Free Pascal GUI application MolWeight, that may be used to determine the weight of molecules, entered by their chemical formula, as well as the weight of DNA, RNA and protein sequences. Click the following link to view the description of the "MolWeight" PC application.

If you find this page helpful or if you like the DNA molecular weight calculator web application, please, support me and this website by signing my guestbook.