The start up gene data
1. The gene data set (extracted from Genbank 117) is provided in three major
(compressed) files:
human_genes.eds.info.Z - containing information about each gene
human_genes.eds.fna.Z - containing the gene sequences
human_genes.eds.faa.Z - containing the protein sequences
Not all of the genes have protein sequences given.
The information file contains entries of the type:
>IDB60265
ACC: X87344 94885..100219
GI: 1054740
ORG: Homo sapiens
TAXA: Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
FET: mRNA complement(join(94885..95043,95943..96084,96314..96443,97054..97185,98244..98311,100160..100219))
ELMS: e1:1..60,i1:61..1908,e2:1909..1976,i2:1977..3034,e3:3035..3166,i3:3167..3776,e4:3777..3906,i4:3907..4135,e5:4136..4277,i5:4278..5176,e6:5177..5335,
AFETS: CDS complement(join(94916..95043,95943..96084,96314..96443,97054..97185,98244..98311,100160..100219))
PRODUCT: not available
EVIDENCE: unknown
PROTEIN_ID: CAA60784.1
PROT: given
END
where the fields have the following meanings:
> Our local identifier.
ACC: The GenBank accession number and the position of the gene (not including
complement information).
GI: The GenBank Gene Index identifier
ORG: Organism (always Homo sapiens in this data set)
TAXA: Taxa.
FET: The feature in the GenBank annotation used to define the gene.
ELMS: The positions of the exons and introns in the local sequence
AFETS: Any other CDS or mRNA feature descriptions in the GenBank annotation.
PRODUCT: The gene product if parsed.
EVIDENCE: "experimental", "not_experimental" or "unknown"
PROTEIN_ID: The protein ID
PROT: "given" means the protein sequence is given in the local files.
END