Cassette exons (cryptic and skipped exons)

  1. Cryptic and skipped exon data as list, table and flat files.
  2. Cassette exons and their translations

Cassette exons are exons that are seen in some transcripts, and not in others. We have divided cassette exons into 'cryptic' exons and 'skipped' exons. Conceptually, a cryptic exon is one that is absent in the normal form, but occurs in an alternative form, while a skipped exon is one that is part of the normal form, and is absent in some alternative form. For the practical purpose of implementing this distinction, skipped exons are defined by covering introns that incorporate entire annotated (not necessarilly observed) exons, and cryptic exons are defined by being seen to occur within observed introns that do not incorporate a full annotated exon. Please see manuscript for discussion of this clasification scheme.


1. Cryptic and skipped exon data as list, table and flat files

cryptic exons
.list
.table
.flat
introns covering cryptic exons
.list
.table
.flat
skipped exons
.list
.table
.flat
introns covering skipped exons
.list
.table
.flat

Each of the data sets above can be downloaded as a simple list of the intron / exon identifiers, as a table, or as a flatfile (see the pages on the intron / exon flat and table files for descriptions).


2. Cassette exons and their translations

Where possible, we have considered the translations of the cassette exons. This data is given, without distinguishing between cryptic and skipped exons, in the following flatfile:

cassette_exons.info

This file details all cases where an observed intron covers observed exons. The data is presented in a flatfile format, where each entry corresponds to a single covering intron. The tag for each entry takes the form:

>IDB60180(183..765)

where 'IDB60180' is the gene identifier, with the covering intron occupying nucleotides '183' to '765' of the gene.

The next three lines take the form:

ELM:    i1(1..411 411)e2(1..45 45)i2(1..127 127)
NUMT:   14
GB:     Y07713 183..765

with the ELM field describing the covering intron in terms of the annotated gene structure. In this case the observed intron covers all of intron 1, exon 2 & intron 2 of the annotated form, with these sequences being 411, 45 & 127 nucleotides in length respectively. The NUMT field gives the number of transcripts that were observed to demonstrate the covering intron. The GB field describes the location of the intron in terms the EMBL/GenBank data set (accession, position).

Following are one or more blocks demarcated by the ICTX tag. This field gives a spliced context for the covering intron. Within this block there are one or more sub blocks demarcated by the TRPT tag. An example of a single intron context with a single context for 'covered exon(s)' is:

ICTX:   ~115..182,766..~810
 TRPT:  ~1..182,594..638,766..~810
  CAT:  simple, 1 c_exons, m0, pp=0
  CEX:  e:594..638 e2(1..45 45)
        numt: 2, phase: 1, mod: 0
        LPPSSTKPPALSHS

The spliced context of the covering intron is, in this case, an exon between nucleotides (approximatey) 115 and 182 on the 5' side of the intron, and an exon between nucleotides 766 and (approximatey) 810 on the 3' side of the intron ('~'s signify that this point is not necessarilly an exon boundary, but simply the point at which the gene - transcript alignment terminated).

The TRPT field describes an observed form of splicing that incorporates covered exon(s). In this case the exon 594..638 in covered by the intron (183..765) - see note 1 below.

The CAT field supplies an automated categorisation of the splicing event in four comma separated sub fields. The first of these four fields takes one of the values: 'simple', 'complex', 'insuf_cover' or 'alt_exons'. A 'simple' event is one where there are one or more cryptic exons, and the flanking exons align at the splice sites of the covering intron. A 'complex' event is one where the bounding exons overlap, but different splice sites have been used. In these cases the form of the difference is given as a 5' and/or 3' truc (truncation), extn (extention), or mod (modification) of the exons flanking the covering intron (see note 2). An event described as 'insuf_cover' has insufficent coverage to allow identification of overlapping flanking exons. An event described as 'alt_exons' demonstrates the use of alternating, or alternative cassette, exons. The complex and insuf_cover cover categories may also be tagged as being probably alternative/alternating ('prob_alt').

The second sub field of the CAT line gives the number of cassette exons covered, while the third gives the 'modularity' (length modulo three) of these cryptic exons. The final sub field, pp={0,1,2,not_det}, gives the overall difference (modulo three) in the number of exonic nucleotides within the two isoforms. That is, a value of pp=0 represent preervation of frame.

For each cryptic exon there are now three lines tagged by CEX.

  CEX:  e:594..638 e2(1..45 45)
        numt: 2, phase: 1, mod: 0
        LPPSSTKPPALSHS

The first line describes the location of the cassette exon in the gene and the relationship between this location and the annotated gene structure. In the example the cassette exon observed is the annotated exon 2. The next line gives the number of transcripts demonstrating this exon (see note 3 below), the phase of the coding sequence at the first nucleotide of the given exon (see note 3), and the modularity of the exon (0, 1, or 2 for exon lengths or (3N + 0), (3N + 1), or (3N + 2)). The final line gives the translation of the exon in the determined phase.

note 1
The extent of the presentation of the surrounding splicing events is limited by two factors. Firstly, it is limited by the available observations, and secondly, in this file, the contexts are trimmed to display only those introns and exons that are relevant to the detrmination of the effect on phase/frame of the alternative events.

note 2
Truncation and extension events are defined with respect to annotaton. If the flanking exons in the cryptic exons context use annotated splice sites, then the splice sites used in the covering form are given as truncations or extensions, otherwise they are simply described as modifications.

note 3
There are some important details to note here. Firstly, the given number of transcripts is the number showing the exon, and not necessarilly the number showing the exon in this particular splicing context. I will modify this one day.. Secondly, the 'phase' at the begining of an exon can take the values 0, 1 or 2 for the cases where the first nucleotide of that exon is at codon position 1, 2, or 3 respectively (in fact, it is really the phase of the preceeding intron). This phase is calculated by establishing an upstream exon overlap between the two isoforms and reconstructing the coding sequence from that point.


fc - 16/8/2001