The alignment data

lines.eds.all.Z      (6.8 Mb)

The alignment data is represented in a parseable form where the matches between a transcript sequence and a gene sequence are described in a single line of text. By way of example consider the line:

IDB60003.est_663601     c(3..184):e1(1..183 183)i1(1..1 2721),c(183..305):i1(2721..2721 2721)e2(1..117 117)i2(1..2 886),c(304..356):e3(1..52 91),381   N[1,2](2719,-2,-2),N[2,3](884,-2,-2),

This states that the gene IDB60003 (the accession, position, gene structure etc are given as part of the gene data) matches with the transcript est_663601 (an EST with Gene Index 663601). The transcript bases 3 to 184 match with the gene starting at posiiton 1 of exon 1 (as annotated and with the annotated exon being 183 nucleotides in length), and ending at position 1 in intron 1 - which is 2721 nucleotides in length. The next match to the transcript covers bases 183 to 305, and matches from the last nulceotide of intron 1 to the second nucleotide of intron 2. And so on. The final field in this second part of the line is the length of the transcript sequence (381 nucleotides in this case). See here for a more technical description of the line format.

Note that the terminal match points may extend a small number of nucleotides beyond the exon termini, and that usually this is artifactual (we examine the splice sites in the next step of the analysis). Note, also, that the coordinates of the transcript sequence will have been inverted in the case of a complement alignment.

The final field in the example 'line' given above is:

       N[1,2](2719,-2,-2),N[2,3](884,-2,-2),

This describes the way in which each pair of consecutive matches along the transcript align in relation to each other, and to the annotated gene structure. Consider the parameters shown in the diagram.

Figure CMP - the 'cg' (cluster gap) and 'gg' (gene gap) parameters. Given annotated exons, as shown, we define the parameters 'a' and 'b' as the number of nucleotides between the terminus of the alignment and the terminus of the annotated exon. We also define the exon gap as: eg = a + b.

Each consecutive match pair is assigned to one of four categories on the basis of these parameters. These categories are not definitive, but are mearly a useful categorisation - especially for viewing of the data by humans.

The first group consists of those events where the cluster gap is either positive or the overlap is greater that ten nucleotides (cg < -10), and cases where the gene gap is not at least ten nucleotides greater than the cluster gap. These events may represent sequence errors or artefactual matches, and are not considered further in this current work. They are described by: A[p,q](gg,cg,eg) where p and q label the exons involved in the alignment - when possible.

The second group consists of those (remaining) events that represent annotated splicing. Such cases are identified by requiring that cg == eg, and that both 'a' and 'b' are zero or negative. Such events as described by: N[p,q](gg,cg,eg).

The third group consists of those (remaining) events where the matches are aligned in such a way as to be one nucleotide away from being categorised as normal (N - category 2 above). Such events may be the result of sequence errors, and are represented by: W[p,q](gg,cg,eg,a,b).

The final category consists of all remaining events. These events are represented by X[p,q](gg,cg,eg,a,b).