The exon flat and table files

exons.flat.Z        (2.5 Mb)
exons.table.Z      (230 Kb)

As for the introns, data on all the observed exons is given in both a flat file and a table file. These exon files have similar, but not identical, formats to the corresponding intron files. The flat file is the central file for exon information, and its format is described first. The table file uses one line per exon, gives only partial information, and is described second.

Format of the exon flat file

By way of example, consider the entry:

>IDB60061(2441..2541)
CDS:    inCDS, p0, tf0
TYPE:   GT-AG A3
ELM:    e5(1..101 101)
NUMT:   294
GB:     D88010  2466..2566
FSAI:   CTTATTAATGTTTGATAATGTTAGGTCATTTTGGGTGGTTTTCTTGAATTGCACCAAATTTTATTTTTAG
FSAE:   gataaggatgctaaattccgtctgattctaatagagagccggattcaccgt..
FSDE:   ..ttggctcgatattataagaccaagcgagtcctccctcccaattggaaata
FSDI:   GTAAGTATCAACTCTTTTGTCGTTGTTATCAAGAATAGGAGTCAGCCAGTAGTAAAAGTCCTAGTAGTAA
OVIN:   i:2185..2474,  GB(2210..2499),  i4(1..256 256)e5(1..34 101)
OVEX:   e:2475..2541,  GB(2500..2566),  e5(35..101 101)
CNTX:   ~1..11,165..213,396..474,2015..2184,2441..2541,3171..~3204
CNTX:   ~1..23,165..213,396..474,2015..2184,2441..2541,3171..~3204
SSIS:   5.64, 7.41
END

Examining the fields in turn:

>IDB60061(2441..2541)

This is the first field and gives the gene identifier and position of the observed exon within the gene.

CDS:    inCDS, p0, tf0

Describes if the exon is compleatly (inCDS), is partially (partCDS), or is not (notCDS) within the annotated coding sequence. If the exon is within annotated coding sequence the start phase of the exon (the phase of the 5' flanking intron) is given as either 'p0', 'p1', 'p2', or where the phase cannot be determined, 'p-'. The phase is determined by comparing the first context given (see below) with the annoated coding sequence. Also given with the 'tf' prefix are the possible start phases that lead to translation of the exon without the introduction of a stop codon within the exon.

TYPE:   GT-AG A3

This field describes which of the six donor site groups this exon has been clasified as belonging to. This will be one of the groups; 'GT-AG A3', 'GT-AG G3', 'GT-AG N3', 'GT-AG weak', 'GC-AG' & 'GT-AG U12'.

ELM:    e5(1..101 101)

This field describes how the observed exon compares with the annotated introns and exons. In this case the observed exon is an annotated form.

NUMT:   294

The number of transcripts observed to confirm this exon.

GB:     D88010  2466..2566

The GenBank/EMBL/DDBJ accession and exon location. In cases where the gene is on the complement strand of the annotaed sequence, this is signified with 'complement(position)'.

FSAI:   CTTATTAATGTTTGATAATGTTAGGTCATTTTGGGTGGTTTTCTTGAATTGCACCAAATTTTATTTTTAG

Up to 70 nts of flanking sequence from the acceptor/upstream inon.

FSAE:   gataaggatgctaaattccgtctgattctaatagagagccggattcaccgt..

Up to 70 nts of sequence from the 5' end of the exon.

FSDE:   ..ttggctcgatattataagaccaagcgagtcctccctcccaattggaaata

Up to 70 nts of sequence from the 3' end of the exon.

FSDI:   GTAAGTATCAACTCTTTTGTCGTTGTTATCAAGAATAGGAGTCAGCCAGTAGTAAAAGTCCTAGTAGTAA

Up to 70 nts of sequence from the donor/downstream exon.

OVIN:   i:2185..2474,  GB(2210..2499),  i4(1..256 256)e5(1..34 101)
OVEX:   e:2475..2541,  GB(2500..2566),  e5(35..101 101)

Each of the overlapping intron (OVIN) and overlapping exon (OVEX) fields may occur 0 or more times, and each occurence describes an intron / exon that is observed and that shares sequence with the current exon.

CNTX:   ~1..11,165..213,396..474,2015..2184,2441..2541,3171..~3204
CNTX:   ~1..23,165..213,396..474,2015..2184,2441..2541,3171..~3204

The 'context' field occurs 1 or more times, and describes the context(s) in which this exon was observed. That is, in this case, that exon 2441..2541 was seen (first CNTX) in one or more transcripts with 4upstream introns and 1 downstream intron. In the second CNTX, one or more transcripts show the same flanking introns and exons apart from the use of an alternative donor site for the first exon. The use of '~' indicates that this position has been determined by the termination of a gene transcript match for which the exact position of the splice site has not been determined. Such a position may be supposed not to extend a long way past a splice site (into an intron), but may be just about anywhere within an exon.

SSIS:   5.64, 7.41

Splice Site Information Scores. The 5' and 3' splice site information scores (acceptor, donor in the case of an exon).

END

The END tag signifies the end of the entry.

 

Format of the exon table file

The format of the exon table file is explained below through consideration of the example entry:

IDB60038(606..721)    GT-AG_A3    gc:0.46    p1,2   bp:-6     ppt:-25..-3   acc:12.49   don:12.13

The first field is the intron identifier.

The second field is the donor site clasification.

The third field gives the G+C content of the gene.

The fourth field gives the start and end phase of the exon (as p#,#).

The fifth field gives the position of the buldge adensoine for the highest scoring candidate U2 BPS in the region -10 to -50 (relative to the acceptor site, ie in the 5' flanking intron), and with a score of at least 5 bits. The null value is '0' (for when there is no such candidate U2 BPS).

The sixth field describes the location (wrt the acceptor site) of the most 3' candidate PPT that extends 3' of -50 (once again, this is relative to the acceptor splice site, and hence in the 5' flanking intron). The null value is '0..0'.

The seventh field is the 'bit score' of the acceptor site.

The eighth field gives the donor site 'bit score'.