AltExtron - gb147 - Alignment data

AltExtron - gb147 - Alignment data


Homo sapiens ae_gb147_human.lines.gz
Mus musculus ae_gb147_mouse.lines.gz
Rattus norvegicus ae_gb147_rat.lines.gz
Drosophila melanogaster ae_gb147_dros.lines.gz
Caenorhabditis elegans ae_gb147_elegan.lines.gz
Arabidopsis thaliana ae_gb147_arab.lines.gz
Danio rerio ae_gb147_zfish.lines.gz
Gallus gallus ae_gb147_chicken.lines.gz
Xenopus laevis ae_gb147_frog.lines.gz
Bos taurus ae_gb147_cow.lines.gz
Anopheles gambiae ae_gb147_mosquito.lines.gz

The alignment data is represented in a parseable form where the matches between a transcript sequence and a gene sequence are described in a single line of text (formated over multiple lines here for clarity). By way of example consider the line:

IDB1078032.est_663601   1,12848    c(1..184):uu(499..500 500)e1(1..183 183)i1(1..1 2721),
                                   c(183..305):i1(2721..2721 2721)e2(1..117 117)i2(1..2 886),
                                   c(304..356):e3(1..52 91),381
                                                                 N[1,2](2719,-2,-2),N[2,3](884,-2,-2),

This states that altExtron gene IDB1078032 matches with the transcript est_663601 (an EST with Gene Index 663601). This match is sense "1" (for the transcript as extracted), and the gene sequence is "12848" nts in length. The transcript bases 1 to 184 match with the gene starting with the last 2 nts of the flanking upstream (uu = unknown upstream) sequence, and then continuing through exon 1 (as annotated and with the annotated exon being 183 nucleotides in length), and ending at position 1 in intron 1 - which is 2721 nucleotides in length. The next match to the transcript covers bases 183 to 305, and matches from the last nulceotide of intron 1 to the 2nd nucleotide of intron 2. And so on. The final field in this third part of the line is the length of the transcript sequence (381 nucleotides in this case).

Note that the terminal match points may extend a small number of nucleotides beyond the exon termini, and that usually this is artifactual (we examine the splice sites in the next step of the analysis). Note, also, that the coordinates of the transcript sequence will have been inverted in the case of a complement alignment.

The final field in the example 'line' given above is:

       N[1,2](2719,-2,-2),N[2,3](884,-2,-2),

This describes the way in which each pair of consecutive matches along the transcript align in relation to each other, and to the annotated gene structure. Consider the parameters shown in the diagram.

Figure CMP - the 'cg' (cluster gap) and 'gg' (gene gap) parameters. Given annotated exons, as shown, we define the parameters 'a' and 'b' as the number of nucleotides between the terminus of the alignment and the terminus of the annotated exon. We also define the exon gap as: eg = a + b.

Each consecutive match pair is assigned to one of four categories on the basis of these parameters:

The first group consists of those events where the cluster gap is either positive or the overlap is greater that ten nucleotides (cg < -10), and cases where the gene gap is not at least ten nucleotides greater than the cluster gap. These events may represent sequence errors or artefactual matches of one form or another, and are not considered further in this current work. They are described by: A[p,q](gg,cg,eg) where p and q label the exons involved in the alignment - when possible.

The second group consists of those (remaining) events that represent annotated splicing. Such cases are identified by requiring that cg == eg, and that both 'a' and 'b' are zero or negative. Such events as described by: N[p,q](gg,cg,eg).

The third group consists of those (remaining) events where the matches are aligned in such a way as to be one or two nucleotides away from being categorised as normal (N - category 2 above). Such events may be the result of sequence errors, and are represented by: W[p,q](gg,cg,eg,a,b). Analysis of the the level of sequence error, the effect of such sequence error on the alignments, and the observed level of these events allows the conclussion that most of these events can be put down to sequence error.

The final category consists of all remaining events. These events are represented by X[p,q](gg,cg,eg,a,b). Some of these events are subsequently found to be consistent with GT-AG, GC-AG or AT-AC type splice sites, while some others are not. The former events (major and minor canonical splice sites) are used to define transcript confirmed introns (and exons), while the later cases are ignored (and many are similar to the W events above, but for non-annotated introns).