Virtual Footprint logo
Virtual Footprint logo Version 3.0 Virtual Footprint logo

arrow Guided Tour/Easy Example (Try this first)

button

Short Introduction

Virtual Footprint is a new sensitive search tool for recognizing single or composite DNA patterns. It was especially designed to analyze transcription factor binding sites in whole bacterial genomes and their underlying regulatory networks. A pattern can consist of various subpatterns separated by a variable spacer region, however this web version is restricted to bipartite patterns due to the time consuming calculation process. The definition of a subpattern is realized either by a position weigth matrix or a IUPAC consensus or a regular expression. A huge library of bacterial position weight matrices is provided.
Furthermore the programm offers the possibility of analzing the results according to their genomic context. Matches in coding regions can be excluded, the size of the upstream region (distance to the start codon) can be defined and the pattern orientation can be selected. The result is a list of potential binding sites and corresponding genes defining the whole regulon.
All matches are hyperlinked to an interactive genome browser to get a visialization of the genomic region and genes are linked to the PRODORIC database to provide further information about the molecular networks.

System Requirement

The web sites require a HTML 4.0-compliant browser with following options: graphics display, JavaScript, Cascaded Style Sheets. The upload function by use of the safari browser on Macintosh systems differs from other systems as it is necessary to choose explicitly "Upload FASTA Sequence" by clicking on the respective radio button.

Alphabetical Help Index

Bipartite Pattern

Complex pattern that constists of two subpattern with a variable spacer region in between. For bipartite pattern search mode the option field bipartite pattern must be selected.
Important Note: biparitite pattern searches take usually at least two times longer than single pattern searches.

Core Sensitivity/Size

By summing up the individual weight of a position weight matrix to an overall score less conserved positions can equiponderate well conserved positions which can lead to an overevaluation of matches. Consequently this results in an accumulation of false-positive predictions. To avoid this, we implemented a core pattern which are the most conserved positions in a position weight matrix. The core sensitivity and size can be defined.
Important note: A core size of 0 inactivates the inclusion of a core score.

Custom Position Weight Matrix

A user-defined Position Weight Matrix can be created by pasting the corresponding sequences in FASTA format in the text box.

FASTA Format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.

  • The description line starts with a greater than symbol (">")
  • The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description
  • The "ID" and the description are optional
  • All lines of text should be shorter than 80 characters
  • The sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins

Example:

>sequence 1
ATCGATCGTGTACTAGCTAGCTGATCGATCGTGGCGCGACTACTATCGATCTACTACTGA
ACTGATCGTAGCTAGCTAGCTGGCGGGGGCGCATCGATCGATCGTAGCTAGCTACTGATC
TAGCTGATCTAGCGATGCGCGCGCGATTATATATCGATCGATCGATCTAGCGCGCATTAA
TAGCTAGCTGATCG
>sequence 2
TACGTGCGGCGGGCGCGCGATATATTATGCTACGATCGTATATATTATCGTAGCTCGATC
TATCTACGTACT

Gene Name/ORF ID

    Upstream sequences can be automatically extracted from whole genome sequences present in the PRODORIC database using the gene short name or ORF ID. Hereby, the upstream size defines the lenght of the sequence upstream the start codon.

Genome Browser GBpro


    GBpro is a genome browser for an interactive navigation through all bacterial genomes available in PRODORIC. Genes, promoters and binding sites are displayed in parallel as graphical map and highlighted sequence. Optionally the GC content and stacking energie can be visualized. All results of Virtual Footprint are directly linked to GBpro and can thereby be visualized in their genomic context. Similarly genes and transcription factor binding sites present in PRODORIC are directly linked this genome browser.

    GBpro

Genomes and Replicons (preselected)

    Several sequenced genomes and replicons (including plasmids) present in the PRODORIC database can be selected for the analyses. The genomes of PRODORIC are updated about every 3-6 months.

Genome Position

    If the global genomic positions of a sequence of interest are known, a sub-sequence can be extracted from a genome present in the PRODORIC database.

Ignore Match Orientation (Strand)

    If this option is not set matches and assigned downstream genes must have the same orientation.

IUPAC Code

    IUPAC code is an extended vocbulary of 15 letters which allows the description of ambiguous DNA code. Each letter represents a combination of one or several nucleotides. It was defined in by the Nomenclature Committee of the International Union of Biochemistry (NC-IUB) in 1984.

    Table 1: IUPAC Code of Nucleic Acid Sequences

    Character Nucleotide Mnemonic
    A A Adenin
    B C,G,T not A
    C C Cytosin
    D A,G,T not C
    G G Guanin
    H A,C,T not G
    K G,T Keto group at common position
    M A,C aMino group at common position
    N A,C,G,T aNy
    R A,G puRin
    S G,C Strong (3 H-bonds)
    T T Thymin
    V A,C,G not T
    W A,T Weak (2 H-bonds)
    Y C,T pYrimidin

Library
Match Properties
Maximum Distance to Gene

    The maximum allowed distance of a match to a downstrem gene (transcriptional start) used in the regulon analysis.

Mismatches

    Number of mismatches in respect to the defined IUPAC code.

Non-Occurrence Panelty

    If one base never occurs at a specific position of a pattern it is possible zu penalize this base in the calculation of the score. This procedure makes a pattern search more selective and is the default setting of Virtual Footprint. A more detailed explanation is found in the position weight matrix section.

Out Of Range Error

    If the pattern frequency is upon a certain limit, the web version of Virtual Footprint skips the search because the length of time for calculation and presentation of the matches would be too long. Reasons for high pattern frequencies are very short patterns, or degenerated patterns with many low conserved positions. This can be avoided by changing the pattern definition or matching features:

    • decrease the sensitivity increase the threshold score or for position weight matrices
    • decrease the number of mismatches for IUPAC sequences
    • if possible choose a longer pattern
    • combine the pattern to a bipartite pattern

    However the commandline version of Virtual Footprint has no limitations concerning pattern frequencies.

Paste Sequence

    Input of a DNA sequence via copy&paste (suitable for short sequences). The sequence must be in raw format which means that only sequence letters are accepted (whitespaces and numbers are filtered out).

    Important note:

    • the number of sequences ist restricted to one
    • the data size is limitted to 10 kB
Position Weight Matrix (PWM)

    Position Weight Matrices (PWM) offer a sensitive way to represent the similarity to a degenerated DNA pattern e.g. transcription factor binding sites. They are built on the basis of a set of aligned known sequences:

    TTGACGTGGATCAG 13.60
    TTGACCTGAATCAG 13.61
    CTGTCATGGATCAA 12.84
    TTGATACAAATCAA 13.65
    TTGACGGCCGTCAA 13.81
    TTGATCGCGGTCAA 13.51
    TTGCCGTGCGTCAA 13.85
    TTGACCGGAATCAA 14.36
    TTGATTCCTATCAA 13.76
    TTGTCTCGCGACAA 12.49
    TTGCTCTGCATCAA 13.86

    We are using the widely accepted information theory approach (Schneider et al., 1986) with some modifications. At first the information vector RSequence(l) is computed:

    formula

    f(b,l) is the frequency of each base b at position l in the aligned binding sites (Schneider et al., 1986). We are considering the nucleotide bias by using a linear correction of noise (Schreiber & Brown, 2002). Using this background model can result in differences concerning the number matches if a sequence is uploaded or directly chosen in the system. This is due to the GC-content of a genome influences the scoring of a match. In uploaded sequences for promoter analysis, the GC-content is not considered as those sequences are usually too short and a GC content of 50% is estimated. This different scoring can result in different matches, especially for lower scoring matches.

    The position weight matrix m(b,l) is afterwards generated by:

    formula

    This is equivalent to the individual letter size of a sequence logo (Schneider & Stephens, 1990). For the case f(b,l)=0 we additionally introduced a penalty function dependend of the sample size n instead of using pseudo-scores:

    formula

    A similarity score is calculated by applying the position weight matrix to a sequence. This is simply done by summing up the corresponding individual weights m(b,l) to an overall score (see numbers on the right of the alignment):

     1234567891011121314
    A0.000.000.000.440.000.010.000.070.040.670.140.002.001.08
    C0.140.000.000.130.670.020.130.210.050.000.002.000.000.00
    G0.000.002.000.000.000.020.130.480.040.380.000.000.000.24
    T1.422.000.000.130.380.010.210.000.010.001.420.000.000.00
Regular Expressions

    Regular expressions are formulas based on metacharacters and normal characters (in this case DNA letters) for matching strings that follow some pattern. They are based on three fundamental ideas:

    • Repetition
    • The asterisk (*) indicates 0 or more repetitions of the character just before it. For example, abc* matches any of these strings: ab, abc, abcc, abccc, abcccc, and so on. The regular expression matches an infinite number of strings.
    • Alternation
    • the pattern (a|b) (read: a or b) matches the string a or the string b.
    • Concatenation
    • the string ab means the character a followed by (concatenated with) the character b

    Regular expressions contain many syntactic features, which are out of scope for this tutorial, but a complete description can be found in perl textbooks. Some useful examples are shown in table 2.

    Table 2: Important Metacharacters of Regular Expressions

    Metacharacterns Meening Example
    ...|... alternatives (G|C)(A|T) defines a dinucleotide consisting of a purine and a pyrimidine
    (...) grouping of elements
    [...] defines a class of characters [ACGT] : all bases (same as (A|C|G|T))
    . wildcard, arbitrary character AT..CG
    ^ search pattern at the beginning of a string
    $ search pattern at the end of a string
    Quantifiers
    + matches one or more occurences

    (GC)*: defines GC repetitions



    TA{1,3}T: defines the motives „TAT“, „TAAT“ and „TAAAT“

    ? matches zero or one occurence
    * matches zero or more occurences
    {n,m} defines the minimal (n) and maximal (m) number of repetitions. {n} means exactly n-times; {n,} means at least n-times

Regulon Analysis

    In the regulon analysis a pattern is applied on whole genome sequence. The resulting matches are analyzed in respect to their genomic locations and assigned to potential regulated genes. The genes are hyperlinked to the PRODORIC database and to the genome browser GBpro. Furtheron there is the possibility to perform a regulog analysis with matches with assigned downstream gene(s).

Regulog Analysis

    A regulog analysis (Alkema, W. B. L., et al., 2004) uses the downstream gene of a match to screen the upstream region of orthologous genes of a related species for the corresponding transcription factor binding site. This is done in a two-step process. In the first step the orthologous genes in PRODORIC and upstream sequences are searched via BLAST (Altschul, S. F, et al., 1990). In the second step Virtual Footprint is applied to these sequences. The obtained matches are listed with the respective BLAST E-values and position weight matrix scores and linked back to the PRODORIC database and the genome browser GBpro. Hereby, the relative conservation score (RCS) is the fraction of orthologs, that share the same potential binding site.

    formula

    The regulog analysis can be a time consuming task (up to a few minutes), especially when the server is working to full capacity.

Remove Redundant Palindromic Matches

    Palindromic matches are usually found on both strands. This option removes the lower scoring match.

Restrict to Noncoding (Intergenic) Regions

    This option restrict genome wide matches to non-coding (intergenic) regions.

Result Constraints

    The results of an individual position weight matrix in a promoter analysis can be restricted to the best x hits.

Result Sorting

    The results can be either ordered by genomic position (default) or by score.

Select Genome/Replicon

    Select a sequences genome present in the PRODORIC database. PRODORIC contains most of the sequenced genomes and plasmids currently available.

Sensitivity/Threshold

    This value ist used to adjust the accuracy of a position weight matrix search by calculation of an appropriate threshold score. Sensitivity (Sn) is defined as the rate of true-positives (TP) at a given threshold score (t):

    formula

    Example: a value of 0.8 means that the threshold score is chosen that 80% of the binding sites used for the position weight matrix are recovered.

    The regulon analyzer offers the possibility enter any threshold score manually. This also enables the utilization of "over-sensitive" thresholds (lower thresholds than defined by a sensitivity of 1.0).
    Important note: setting the threshold manually overwrites the sensitivity settings

Sequence Logo

    A sequence logo is a graphical representation of a position weigth matrix. The height of a pile represents the information content at a certain position (Rsequence(l)) whereas the height of a letter represents the individual weight of a base at a certain position (m(b,l)).

    seqlogo
    Sequence Logo was generated by the use of WebLogo (http://weblogo.berkeley.edu/)
Single Pattern

    Virtual Footprint can handle single patterns consisting of one subpattern and bipartite patterns.

Spacer

    Variable space between two subpatterns in the bipartite pattern search mode. The distance can be confined by setting a minimal and maximal space between the subpatterns. Default setting is 1-50 bp.

Subpattern Type
Upload FASTA Sequence

    Upload a user-defined DNA sequence in FASTA format

    Important note:

    • the number of sequences ist restricted to one sequence per FASTA file
    • the file size for a regulon analysis is limitted to 10MB
    • the sequence size for a promoter analysis is limitted to 10000bp

    There can be differences in the matches if a sequence is uploaded or directly chosen in the system. Virtual Footprint uses a background model to account for biased genomes. Therefore the GC-content of a genome influences the scoring of a match. In uploaded sequences for promoter analysis, the GC-content is not considered as those sequences are usually too short and a GC content of 50% is estimated. This different scoring can result in different matches, especially for lower scoring matches.

Upstream Size

    Sequenze size to extract a upstream sequences of a given gene defined by the gene name or ORF ID. The size must range between 100 and 10000 bp

Weight Matrix Information

    Shows information about the selected position weight matrix pattern in a separate window. In detail it provides information about

    • the position weight matrix values with the respective maximum and minimum achievable scores
    • the sequences used for constructing the position weight matrix including their individual scores including the mean score and standard deviation
    • a sequence logo

References

    Alkema, W. B. L., Lenhard, B. & Wasserman, W. W. (2004). Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res. 14, 1362-1373.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410.

    Schneider, T. D. & Stephens, R. M. (1990). Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 18, 6097-6100.

    Schneider, T. D., Stormo, G. D. & Gold, L. (1986). Information content of binding sites on nucleotide sequences. J. Mol. Biol. 188, 415-431.

    Schreiber, M. & Brown C. (2002). Compensation for nucleotide bias in a genome by representation as a discrete channel with noise. Bioinformatics 18, 507-512.

© 2003-2008 by Richard Münch •  Institute of Microbiology • Technical University of Braunschweig •  r.muench(at)tu-bs.de