Data on all the observed introns is given in both a flat file and a table file. The flat file is the central file for intron information, and its format is described first. The table file uses one line per intron, gives only partial information, and is described second. Note that the exon flat and table files have similar, but not identical, formats and are described separately.
Format of the intron flat file
By way of example, consider the entry:
>IDB60041(1640..2350) CDS: inCDS, p0 TYPE: GT-AG G3 ELM: i3(1..711 711) NUMT: 16 GB: AB021866 1682..2392 FSDE: gcggaccgtggagtcgtcacttcgggcacaagtgcccttcgagcagattctcagccttccagagctcaag FSDI: GTGCAAGCGCTCCCCTCCTTTGACACCTCTCCCACCACTCCCTCCCTGCTAGACCCCCTAACTCCATCTG.. FSAI: ..CTCTCAAGTTTCTGGTAGGCTTTAATGAGCGTGTGACCTGGGCCACGTCCTGTGGCGTTTGTTCTCCTAG FSAE: gccaaccccttcaaggagcgaatctgcagggtcttctccacatccccagccaaagacagccttagctttg OVIN: i:218..2350, GB(260..2392), i2(1..1313 1313)e3(1..109 109)i3(1..711 711) CNTX: ~1..51,183..217,1531..1639,2351..2501,2641..2759,2832..~2917 CNTX: ~1579..1639,2351..2501,2641..2759,2832..2920,3350..~3371 SSIS: 4.95, 8.71 (U12 -9.644) BPPPT: BP(-77, 4.29), PPT(-76, -66), BP(-46, 6.02), BP(-35, 4.29), PPT(-28, -18), PPT(-13, -3) GGCC: 0.557 END
Examining the fields in turn:
>IDB60041(1640..2350)
This is the first field and gives the gene identifier and position of the observed intron within the gene.
CDS: inCDS, p0
Describes if the intron is, or is not, within the annotated coding sequence. If the intron is within annotated coding sequence the phase of the intron is given as either 'p0', 'p1', 'p2', or where the phase cannot be determined, 'p-'. The phase is determined by examining both the position of the intron in the annotated CDS and the context of the intron (see below).
TYPE: GT-AG G3
This field describes which of the six donor site groups this intron has been clasified as belonging to. This will be one of the groups; 'GT-AG A3', 'GT-AG G3', 'GT-AG N3', 'GT-AG weak', 'GC-AG' & 'GT-AG U12'.
ELM: i3(1..711 711)
This field describes how the observed intron compares with the annotated introns and exons. In this case the observed intron is an annotated intron.
NUMT: 16
The number of transcripts observed to confirm this intron.
GB: AB021866 1682..2392
The GenBank/EMBL/DDBJ accession and intron location. In cases where the gene is on the complement strand of the annotaed sequence, this is signified with 'complement(position)'.
FSDE: gcggaccgtggagtcgtcacttcgggcacaagtgcccttcgagcagattctcagccttccagagctcaag
Up to 70 nts of flanking sequence from the donor/upstream exon.
FSDI: GTGCAAGCGCTCCCCTCCTTTGACACCTCTCCCACCACTCCCTCCCTGCTAGACCCCCTAACTCCATCTG..
Up to 70 nts of sequence from the 5' end of the intron.
FSAI: ..CTCTCAAGTTTCTGGTAGGCTTTAATGAGCGTGTGACCTGGGCCACGTCCTGTGGCGTTTGTTCTCCTAG
Up to 70 nts of sequence from the 3' end of the intron.
FSAE: gccaaccccttcaaggagcgaatctgcagggtcttctccacatccccagccaaagacagccttagctttg
Up to 70 nts of sequence from the acceptor/downstream exon.
OVIN: i:218..2350, GB(260..2392), i2(1..1313 1313)e3(1..109 109)i3(1..711 711)
This field may occur 0 or more times, and each occurence describes an intron that is observed and that shares sequence with the current intron. There is a similar field for exons 'OVEX' of which there are 0 in this example entry.
CNTX: ~1..51,183..217,1531..1639,2351..2501,2641..2759,2832..~2917 CNTX: ~1579..1639,2351..2501,2641..2759,2832..2920,3350..~3371
The 'context' field occurs 1 or more times, and describes the context(s) in which this intron was observed. That is, in this case, that intron 1640..2350 was seen (first CNTX) in one or more transcripts with 2 upstream introns and 2 downstream introns. In the second CNTX, the intron is seen in one or more transcripts with three upstream introns. The use of '~' indicates that this position has been determined by the termination of a gene transcript match for which the exact position of the splice site has not been determined. Such a position may be supposed not to extend a long way past a splice site (into an intron), but may still be just about anywhere within the exon.
SSIS: 4.95, 8.71 (U12 -9.644)
Splice Site Information Scores. The 5' and 3' splice site information scores (donor, acceptor in the case of an intron). The score of the donor site against the U12 consensus is given in parentheses.
BPPPT: BP(-77, 4.29), PPT(-76, -66), BP(-46, 6.02), BP(-35, 4.29), PPT(-28, -18), PPT(-13, -3)
The Branch Point and Poly Pyrimidine Tract data. Polypyrimidime tracts are described as PPT(a, b), where a and b are the terminal position of the candidate PPT in relation to the acceptor splice site. Candidate U2 and U12 branch point sequences are described by BP(a, s) and U12BP(a, s), where a is the position of the buldge adenosine relative to the acceptor splice site and s is the score (in bits) of the candidate.
GGCC: 0.557
The Gene G+C content.
END
The END tag signifies the end of the entry.
Format of the intron table file
One advantage of the intron data given in the table file is that probable (U2) BP and PPT signals have been chosen for listing (as opposed to the listing of all candidates in the flat file). The format of the table file is explained below through consideration of the example entry:
IDB60003(5759..6754) GT-AG_G3 gc:0.429 p0 don:9.34 bp:-16 ppt:-15..-3 acc:6.61
The first field is the intron identifier.
The second field is the donor site clasification.
The third field gives the G+C content of the gene.
The fourth field gives the phase of the intron as p0, p1, p2 or p- (for not determined / not in CDS).
The fifth field gives the donor site 'bit score'.
The sixth field gives the position of the buldge adensoine for the highest scoring candidate U2 BPS in the region -10 to -50 (relative to the acceptor site), and with a score of at least 5 bits. The null value is '0' (for when there is no such candidate U2 BPS).
The seventh field describes the location (wrt the acceptor site) of the most 3' candidate PPT that extends 3' of -50. The null value is '0..0'.
The eighth field is the 'bit score' of the acceptor site.