AltExtron - gb147 - flatfiles of transcript confirmed introns

AltExtron - gb147 - flatfiles of transcript confirmed introns

Data on all the observed introns is given in a flat file. Note that the exon flat files have similar, but not identical, format and are described separately.


Homo sapiens ae_gb147_human_introns.flat.gz
Mus musculus ae_gb147_mouse_introns.flat.gz
Rattus norvegicus ae_gb147_rat_introns.flat.gz
Drosophila melanogaster ae_gb147_dros_introns.flat.gz
Caenorhabditis elegans ae_gb147_elegan_introns.flat.gz
Arabidopsis thaliana ae_gb147_arab_introns.flat.gz
Danio rerio ae_gb147_zfish_introns.flat.gz
Gallus gallus ae_gb147_chicken_introns.flat.gz
Xenopus laevis ae_gb147_frog_introns.flat.gz
Bos taurus ae_gb147_cow_introns.flat.gz
Anopheles gambiae ae_gb147_mosquito_introns.flat.gz
Format of the intron flat file

By way of example, consider the entry:

>IDB60041(1640..2350)
CDS:    inCDS, p0
TYPE:   GT-AG G3
ELM:    i3(1..711 711)
NUMT:   16
GB:     AB021866.1  1682..2392
FSDE:   gcggaccgtggagtcgtcacttcgggcacaagtgcccttcgagcagattctcagccttccagagctcaag
FSDI:   GTGCAAGCGCTCCCCTCCTTTGACACCTCTCCCACCACTCCCTCCCTGCTAGACCCCCTAACTCCATCTG..
FSAI:   ..CTCTCAAGTTTCTGGTAGGCTTTAATGAGCGTGTGACCTGGGCCACGTCCTGTGGCGTTTGTTCTCCTAG
FSAE:   gccaaccccttcaaggagcgaatctgcagggtcttctccacatccccagccaaagacagccttagctttg
OVIN:   i:218..2350,  GB(260..2392),  i2(1..1313 1313)e3(1..109 109)i3(1..711 711)
CNTX:   ~1..51,183..217,1531..1639,2351..2501,2641..2759,2832..~2917
CNTX:   ~1579..1639,2351..2501,2641..2759,2832..2920,3350..~3371
SSIS:   4.95, 8.71
GGCC:   0.557
END

Examining the fields in turn:

>IDB60041(1640..2350)

This is the first field and gives the gene identifier and position of the observed intron within the gene.

CDS:    inCDS, p0

Describes if the intron is, or is not, within the annotated coding sequence. If the intron is within annotated coding sequence the phase of the intron is given as either 'p0', 'p1', 'p2', or where the phase cannot be determined, 'p-'. The phase is determined by examining both the position of the intron in the annotated CDS and the context of the intron (see below).

TYPE:   GT-AG G3

This field describes which of the six donor site groups this intron has been clasified as belonging to. This will be one of the groups; 'GT-AG A3', 'GT-AG G3', 'GT-AG Y3', 'GT-AG weak', 'GC-AG', 'AT-AC' and also any 'Annotated-Non-Canonical' events (which are usually annotation errors).

ELM:    i3(1..711 711)

This field describes how the observed intron compares with the annotated introns and exons. In this case the observed intron is an annotated intron.

NUMT:   16

The number of transcripts observed to confirm this intron.

GB:     AB021866.1  1682..2392

The GenBank/EMBL/DDBJ accession.version and intron location. In cases where the gene is on the complement strand of the annotaed sequence, this is signified with 'complement(position)'.

FSDE:   gcggaccgtggagtcgtcacttcgggcacaagtgcccttcgagcagattctcagccttccagagctcaag

Up to 70 nts of flanking sequence from the donor/upstream exon (Flanking Seq. Donor Exon).

FSDI:   GTGCAAGCGCTCCCCTCCTTTGACACCTCTCCCACCACTCCCTCCCTGCTAGACCCCCTAACTCCATCTG..

Up to 70 nts of sequence from the 5' end of the intron.

FSAI:   ..CTCTCAAGTTTCTGGTAGGCTTTAATGAGCGTGTGACCTGGGCCACGTCCTGTGGCGTTTGTTCTCCTAG

Up to 70 nts of sequence from the 3' end of the intron.

FSAE:   gccaaccccttcaaggagcgaatctgcagggtcttctccacatccccagccaaagacagccttagctttg

Up to 70 nts of sequence from the acceptor/downstream exon.

OVIN:   i:218..2350,  GB(260..2392),  i2(1..1313 1313)e3(1..109 109)i3(1..711 711)

This field may occur 0 or more times, and each occurence describes an intron that is observed and that shares sequence with the current intron. There is a similar field for exons 'OVEX' of which there are 0 in this example entry.

CNTX:   ~1..51,183..217,1531..1639,2351..2501,2641..2759,2832..~2917
CNTX:   ~1579..1639,2351..2501,2641..2759,2832..2920,3350..~3371

The 'context' field occurs 1 or more times, and describes the context(s) in which this intron was observed. That is, in this case, that intron 1640..2350 was seen (first CNTX) in one or more transcripts with 2 upstream introns and 2 downstream introns. In the second CNTX, the intron is seen in one or more transcripts with three upstream introns. The use of '~' indicates that this position has been determined by the termination of a gene transcript match for which the exact position of the splice site has not been determined. Such a position may be supposed not to extend a long way past a splice site (into an intron), but may still be just about anywhere within the exon.

SSIS:   4.95, 8.71

Splice Site Information Scores. The 5' and 3' splice site information scores (donor, acceptor in the case of an intron).

GGCC:   0.557

The Gene G+C content.

END

The END tag signifies the end of the entry.