*** WARNING ***
This analysis is PRELIMINARY, as will be discussed below, and we present this data as such. The data may be useful for initial examination of individual cases, as well as for providing insights into the problems and issues involved in the generation of this type of data. It may be possible to generate a data set of alternative protein isoforms by careful inspection, selection and validation of the data presented here.
*** WARNING ***
alt_mRNA_translations.flat (3.8 Mb)
While the analysis of alternative splicing presented in the manuscript considered ONLY those cases where transcript data indicated multiple isoforms, here we examine the translations of observed isoforms that differ from annotation.
There are several inherent problems in constructing translations on the basis of gene-EST alignments. The first is that the EST sequence generally provides only partial coverage of the gene, and consequently it is necessary to assume that the pattern of splicing outside of the region observed is as described in the annotation. This assumption has two parts; first it assumes that the annotation is itself correct, and second, it assumes that the observed alternative splicing is a single independent event - that is, it does not occur as one event in a pair. While we have formed an impression that this second assumption usually holds, it is reasonable to suppose that, in at least a few cases, the assumption will fail. However, we have little choice but to adopt this assumption if we wish to proceed.
The second problem is that of determining a translation. Alternative splicing may insert or delete some number of whole codons from the transcript, but may also insert or delete a number of nucleotides that is not a multiple of three, and hence introduce a frame shift into the transcript (if, of course, the alternative event occurs within the CDS rather than the UTRs). Such events change the location of the stop codon, and it is not at all clear when such events act to create protein isoforms, and when they produce nonsense - either as a regulatory event, or as a "splicing error". In some instances it may be that different isoforms utilize different start codons. There are many possibilities, and at this stage we choose to view the data in the most general way we can (as will become apparent shortly).
A further problem lies in the fact that many of the gene sequences that we are working with have been truncated at the boundaries of the annotated CDS.
The approach that we have taken is to overlay the partial exon structure determined by gene-EST alignment with the annotated exonic structure, and if these alignments agree at the boundaries, then to replace the annotated part with the observed part. Thus the first five fields given for each entry in the data flat file are:
> anot: exons in the annotated form > alt: (inferred) exons in the alternative form > over: regions of overlap > ctx: the transcript alignment containing unannotated splice site(s) > cds: the annotated CDS start and end points
Following this is given the translation for the annotated protein, including only that region of the mRNA that is annotated as CDS. Then FULL translations of the alternative mRNA, in each of the three phases/frames, are given. The reason for showing all three frames is a reluctance to make the assumption that the starting methionine codon is necessarily going to be the same between alternative isoforms. It may be that such an assumption is justifiable, and if so, one could simplify this data considerably by only looking at the alternative translation in the frame that includes the constitutive start codon.
It is usually easy to look at the three translations and pick the part of one of them that is (potentially) the alternative form. However, it is quite another thing to get the computer to work out, in general, what is interesting, and so we are stuck with looking at all three translations, at least for now.
Indicated in bold are those regions of the translation that are derived from mRNA that is not shared between the annotated and alternative form, regardless of frame. This feature is a little rough in places, and is not 100% correct at boundary points (including, sometimes, shared intron boundaries).
Below are two example entries followed by comment.
Example one:
>IDB60184 anot: 1..58,418..684,1309..1578,1836..2114,2507..2604 alt: 1..58,418..684,1309..1578,1836..2010,2507..2604 over: 1..58,418..684,1309..1578,1836..2010,2507..2604 ctx: ~1971..2010,2507..~2600 cds: 1,2604 ANNOTATED PROTEIN: mlllfllfeglccpgentaaaeeqlsfrmlqtssfanhswahsegsgwlgdlqthgwdtvlgtirflkpwshgnfskqel knlqslfqlyfhsfiqivqasagqfqleypfeiqilagcrmnapqiflnmayqgsdflsfqgiswepspgagiraqnick vlnryldikeilqsllghtcprflaglmeageselkrkvkpeawlscgpspgpgrlqlvchvsgfypkpvwvmwmrgeqe qrgtqrgdvlpnadetWYLRATLDVAAGEAAGLSCRVKHSSLGGHDLIIHWggysiflilicltvivtlvilvvvdsrlk kqr* FRAME 1 TRANSLATION: mlllfllfeglccpgentaaaeeqlsfrmlqtssfanhswahsegsgwlgdlqthgwdtvlgtirflkpwshgnfskqel knlqslfqlyfhsfiqivqasagqfqleypfeiqilagcrmnapqiflnmayqgsdflsfqgiswepspgagiraqnick vlnryldikeilqsllghtcprflaglmeageselkrkvkpeawlscgpspgpgrlqlvchvsgfypkpvwvmwmrgeqe qrgtqrgdvlpnadetwwifhlshpdlfdcdsypghigcs*ltvkktev FRAME 2 TRANSLATION: ccscssssrvsavlgkiqqqqrsscpsacsklpplpttaghtvraqdgwvtcrlmagtlswapsaf*spgpmetsasrs* ktyshcssytsivlsr*cklllvnfslntpsrsry*lave*mphkss*iwhikgqis*vskefpgshlqeqgsgprtsvk csiat*ilrkyckaflvtpaldf*rgswkqgsqn*ngk*sqrpgcpvapvlalavcslcamsqdstqspcg*cgcgvsrs sgalsegtsclmltrhggysiflilicltvivtlvilvvvdsrlkkqr* FRAME 3 TRANSLATION: aapvpplrgsllswgkysSsrgaavlphapnfllcqpqlgtq*glrmag*padswlghclghhplsealvpwklqqagae kltvtvpvilp*fypdsasfcwsisa*Iplrdpdiswl*necptnllkygisrvrfpefprnflgaisrsrdpgpehl*s aqslpry*gntakpswshlpsissgahgsrgvrteteSearglavlwpqswpwpsaacvpclrilpkarvgdvdag*aga aghsargrpa*c*rdmvdipsfss*sv*l**lpwsywl*lthg*knrg END
Comment: In this case exon four is truncated at the 3' end in the alternative form by 104 nucleotides (exon 4 in the normal form is "1836..2114", while in the alternative form it is "1836..2010"). This corresponds to 34 and 2/3 codons, and thus a frame shift is introduced. These codons (roughly) are indicated by the use of upper case in the annotated translation. The translation of the alternative mRNA is given in each of the three possible frames, and it is apparent that the translations in frame 2 and frame 3 are nonsense. The frame 1 translation is the same as the annotated translation up to the point in exon four where the truncation takes place. From this point the alternative form is seen to end with "wwifhlshpdlfdcdsypghigcs*", rather than "WYLRATLDVAAGEAAGLSCRVKHSSLGGHDLIIHWggysiflilicltvivtlvilvvvdsrlkkqr*" seen in the normal form. This is an example of a frame breaking alternative splicing event resulting in an alternative C-terminus.
Example two:
>IDB60241 anot: 1..198,280..413,1497..1607,1925..2019 alt: 1..198,1497..1607,1925..2019 over: 1..198,1497..1607,1925..2019 ctx: ~79..198,1497..1607,1925..~2018 cds: 72,1948 ANNOTATED PROTEIN: marslvclgviillsafsgpgvrggpmpkladrklcadqecsHPISMAVALQDYMAPDCRFLTIHRGQVVYVFSKLKGRG RLFWGGSvqgdyygdlaarlgyfpssivredqtlkpgkvdvktdkwdfycq* FRAME 1 TRANSLATION: rergrggnwrpqhplahslahsprwpgpwcalvssscclpspdlvsgvvlcpswltgscvrtrsaavqgdyygdlaarlg yfpssivredqtlkpgkvdvktdkwdfycq*aqptagpavsppwvyantispvq FRAME 2 TRANSLATION: greggeeigdpstplltllltvhdgpvpgvpwchhlavcllrtwcqgwsyaqag*peavcgpgvqPfreitmeiwllawa ispvalsertrp*nlaksm*rqTngistaselslplalpfpllgfmqiqsaqck FRAME 3 TRANSLATION: geregrkletpappcslscsqstmarslvclgviillsafsgpgvrggpmpkladrklcadqecsRsgrllwrsgcspgl fpq*hcprgpdpetwqsrcedrQmgfllpvssayrwpcrfpslglckynqpsan END
Comment: In this case the alternative form skips exon two (280..413). This exon is 133 nucleotides in length (44 and 1/3 codons), and thus the alternative form contains a frame shift. The frame 3 translation of the alternative mRNA picks up the annotated form after translation of the 5' UTR. Downstream of the skipped exon, visible in this case by the presence of an uppercase 'R', the protein takes on an alternative translation, terminating in a stop codon after 17 amino acids.
fc - 14/9/2001