As for the introns, data on all the observed exons is given in both a flat file and a table file. These exon files have similar, but not identical, formats to the corresponding intron files. The flat file is the central file for exon information, and its format is described first. The table file uses one line per exon, gives only partial information, and is described second.
Format of the exon flat file
By way of example, consider the entry:
>IDB60061(2441..2541) CDS: inCDS, p0, tf0 TYPE: GT-AG A3 ELM: e5(1..101 101) NUMT: 294 GB: D88010 2466..2566 FSAI: CTTATTAATGTTTGATAATGTTAGGTCATTTTGGGTGGTTTTCTTGAATTGCACCAAATTTTATTTTTAG FSAE: gataaggatgctaaattccgtctgattctaatagagagccggattcaccgt.. FSDE: ..ttggctcgatattataagaccaagcgagtcctccctcccaattggaaata FSDI: GTAAGTATCAACTCTTTTGTCGTTGTTATCAAGAATAGGAGTCAGCCAGTAGTAAAAGTCCTAGTAGTAA OVIN: i:2185..2474, GB(2210..2499), i4(1..256 256)e5(1..34 101) OVEX: e:2475..2541, GB(2500..2566), e5(35..101 101) CNTX: ~1..11,165..213,396..474,2015..2184,2441..2541,3171..~3204 CNTX: ~1..23,165..213,396..474,2015..2184,2441..2541,3171..~3204 SSIS: 5.64, 7.41 END
Examining the fields in turn:
>IDB60061(2441..2541)
This is the first field and gives the gene identifier and position of the observed exon within the gene.
CDS: inCDS, p0, tf0
Describes if the exon is compleatly (inCDS), is partially (partCDS), or is not (notCDS) within the annotated coding sequence. If the exon is within annotated coding sequence the start phase of the exon (the phase of the 5' flanking intron) is given as either 'p0', 'p1', 'p2', or where the phase cannot be determined, 'p-'. The phase is determined by comparing the first context given (see below) with the annoated coding sequence. Also given with the 'tf' prefix are the possible start phases that lead to translation of the exon without the introduction of a stop codon within the exon.
TYPE: GT-AG A3
This field describes which of the six donor site groups this exon has been clasified as belonging to. This will be one of the groups; 'GT-AG A3', 'GT-AG G3', 'GT-AG N3', 'GT-AG weak', 'GC-AG' & 'GT-AG U12'.
ELM: e5(1..101 101)
This field describes how the observed exon compares with the annotated introns and exons. In this case the observed exon is an annotated form.
NUMT: 294
The number of transcripts observed to confirm this exon.
GB: D88010 2466..2566
The GenBank/EMBL/DDBJ accession and exon location. In cases where the gene is on the complement strand of the annotaed sequence, this is signified with 'complement(position)'.
FSAI: CTTATTAATGTTTGATAATGTTAGGTCATTTTGGGTGGTTTTCTTGAATTGCACCAAATTTTATTTTTAG
Up to 70 nts of flanking sequence from the acceptor/upstream inon.
FSAE: gataaggatgctaaattccgtctgattctaatagagagccggattcaccgt..
Up to 70 nts of sequence from the 5' end of the exon.
FSDE: ..ttggctcgatattataagaccaagcgagtcctccctcccaattggaaata
Up to 70 nts of sequence from the 3' end of the exon.
FSDI: GTAAGTATCAACTCTTTTGTCGTTGTTATCAAGAATAGGAGTCAGCCAGTAGTAAAAGTCCTAGTAGTAA
Up to 70 nts of sequence from the donor/downstream exon.
OVIN: i:2185..2474, GB(2210..2499), i4(1..256 256)e5(1..34 101) OVEX: e:2475..2541, GB(2500..2566), e5(35..101 101)
Each of the overlapping intron (OVIN) and overlapping exon (OVEX) fields may occur 0 or more times, and each occurence describes an intron / exon that is observed and that shares sequence with the current exon.
CNTX: ~1..11,165..213,396..474,2015..2184,2441..2541,3171..~3204 CNTX: ~1..23,165..213,396..474,2015..2184,2441..2541,3171..~3204
The 'context' field occurs 1 or more times, and describes the context(s) in which this exon was observed. That is, in this case, that exon 2441..2541 was seen (first CNTX) in one or more transcripts with 4upstream introns and 1 downstream intron. In the second CNTX, one or more transcripts show the same flanking introns and exons apart from the use of an alternative donor site for the first exon. The use of '~' indicates that this position has been determined by the termination of a gene transcript match for which the exact position of the splice site has not been determined. Such a position may be supposed not to extend a long way past a splice site (into an intron), but may be just about anywhere within an exon.
SSIS: 5.64, 7.41
Splice Site Information Scores. The 5' and 3' splice site information scores (acceptor, donor in the case of an exon).
END
The END tag signifies the end of the entry.
Format of the exon table file
The format of the exon table file is explained below through consideration of the example entry:
IDB60038(606..721) GT-AG_A3 gc:0.46 p1,2 bp:-6 ppt:-25..-3 acc:12.49 don:12.13
The first field is the intron identifier.
The second field is the donor site clasification.
The third field gives the G+C content of the gene.
The fourth field gives the start and end phase of the exon (as p#,#).
The fifth field gives the position of the buldge adensoine for the highest scoring candidate U2 BPS in the region -10 to -50 (relative to the acceptor site, ie in the 5' flanking intron), and with a score of at least 5 bits. The null value is '0' (for when there is no such candidate U2 BPS).
The sixth field describes the location (wrt the acceptor site) of the most 3' candidate PPT that extends 3' of -50 (once again, this is relative to the acceptor splice site, and hence in the 5' flanking intron). The null value is '0..0'.
The seventh field is the 'bit score' of the acceptor site.
The eighth field gives the donor site 'bit score'.