AltExtron - gb147 - gene data
The gene data, for each organism, is provided in three (compressed) files:
a nucleotide sequence file
a protein sequence file
an information file
Not all of the genes have protein sequences given.
The information file format is detailed below.
See other notes at bottom of this page.
The information file contains entries of the type:
>IDB1078032
LOCUS AB005803 15499 bp DNA linear PRI 20-MAR-1999
DEFINITION Homo sapiens DNA for histidine-rich glycoprotein, complete cds.
ACC: AB005803.1 (1801..14648)
GI: 2280513
ORG: Homo sapiens
FET: CDS join(2301..2483,5205..5321,6208..6298,7892..8058,9055..9135,11357..11458,13312..14148)
ELMS: uu:1..500,e1:501..683,i1:684..3404,e2:3405..3521,i2:3522..4407,e3:4408..4498,i3:4499..6091,e4:6092..6258,i4:6259..7254,e5:7255..7335,i5:7336..9556,e6:9557..9658,i6:9659..11511,e7:11512..12348,ud:12349..12848,
AFETS:
PRODUCT: histidine-rich glycoprotein,
EVIDENCE: not available
PROTEIN_ID: BAA21613.1,
END
where the fields have the following meanings:
> altExtron gene identifier, always of the form IDB#.
LOCUS & DEFINITION as from the GenBank flat files
ACC: The GenBank accession number, version, and the position of
the gene (not including complement information - see FET).
GI: The GenBank Gene Index identifier
ORG: Organism (always Homo sapiens in this data set)
FET: The feature in the GenBank annotation used to define the gene.
ELMS: The positions of the exons (e#:) and introns (i#:) in the local
sequence, including potentially up to 500 nts of unknown upstream
(uu:) and downstream (ud:) sequence - based on FET.
AFETS: Any other CDS or mRNA feature descriptions in the GenBank annotation.
PRODUCT: The gene product if and as parsed from GenBank.
EVIDENCE: "experimental", "not_experimental" or "unknown"
PROTEIN_ID: The protein ID (may be more that one)
END A useful tag for parsing purposes.
NOTES
* The genes provided here are that subset of genes extracted from GenBank
that end up with one or more transcript confirmed introns/exons.