Translations of alternative isoforms

*** WARNING ***

This analysis is PRELIMINARY, as will be discussed below, and we present this data as such. The data may be useful for initial examination of individual cases, as well as for providing insights into the problems and issues involved in the generation of this type of data. It may be possible to generate a data set of alternative protein isoforms by careful inspection, selection and validation of the data presented here.

*** WARNING ***

 

alt_mRNA_translations.flat         (3.8 Mb)

While the analysis of alternative splicing presented in the manuscript considered ONLY those cases where transcript data indicated multiple isoforms, here we examine the translations of observed isoforms that differ from annotation.

There are several inherent problems in constructing translations on the basis of gene-EST alignments. The first is that the EST sequence generally provides only partial coverage of the gene, and consequently it is necessary to assume that the pattern of splicing outside of the region observed is as described in the annotation. This assumption has two parts; first it assumes that the annotation is itself correct, and second, it assumes that the observed alternative splicing is a single independent event - that is, it does not occur as one event in a pair. While we have formed an impression that this second assumption usually holds, it is reasonable to suppose that, in at least a few cases, the assumption will fail. However, we have little choice but to adopt this assumption if we wish to proceed.

The second problem is that of determining a translation. Alternative splicing may insert or delete some number of whole codons from the transcript, but may also insert or delete a number of nucleotides that is not a multiple of three, and hence introduce a frame shift into the transcript (if, of course, the alternative event occurs within the CDS rather than the UTRs). Such events change the location of the stop codon, and it is not at all clear when such events act to create protein isoforms, and when they produce nonsense - either as a regulatory event, or as a "splicing error". In some instances it may be that different isoforms utilize different start codons. There are many possibilities, and at this stage we choose to view the data in the most general way we can (as will become apparent shortly).

A further problem lies in the fact that many of the gene sequences that we are working with have been truncated at the boundaries of the annotated CDS.

The approach that we have taken is to overlay the partial exon structure determined by gene-EST alignment with the annotated exonic structure, and if these alignments agree at the boundaries, then to replace the annotated part with the observed part. Thus the first five fields given for each entry in the data flat file are:


> anot:   exons in the annotated form
> alt:    (inferred) exons in the alternative form
> over:   regions of overlap
> ctx:    the transcript alignment containing unannotated splice site(s)
> cds:    the annotated CDS start and end points

Following this is given the translation for the annotated protein, including only that region of the mRNA that is annotated as CDS. Then FULL translations of the alternative mRNA, in each of the three phases/frames, are given. The reason for showing all three frames is a reluctance to make the assumption that the starting methionine codon is necessarily going to be the same between alternative isoforms. It may be that such an assumption is justifiable, and if so, one could simplify this data considerably by only looking at the alternative translation in the frame that includes the constitutive start codon.

It is usually easy to look at the three translations and pick the part of one of them that is (potentially) the alternative form. However, it is quite another thing to get the computer to work out, in general, what is interesting, and so we are stuck with looking at all three translations, at least for now.

Indicated in bold are those regions of the translation that are derived from mRNA that is not shared between the annotated and alternative form, regardless of frame. This feature is a little rough in places, and is not 100% correct at boundary points (including, sometimes, shared intron boundaries).

Below are two example entries followed by comment.

Example one:

>IDB60184
anot:   1..58,418..684,1309..1578,1836..2114,2507..2604
alt:    1..58,418..684,1309..1578,1836..2010,2507..2604
over:   1..58,418..684,1309..1578,1836..2010,2507..2604
ctx:    ~1971..2010,2507..~2600
cds:    1,2604
ANNOTATED PROTEIN:
mlllfllfeglccpgentaaaeeqlsfrmlqtssfanhswahsegsgwlgdlqthgwdtvlgtirflkpwshgnfskqel
knlqslfqlyfhsfiqivqasagqfqleypfeiqilagcrmnapqiflnmayqgsdflsfqgiswepspgagiraqnick
vlnryldikeilqsllghtcprflaglmeageselkrkvkpeawlscgpspgpgrlqlvchvsgfypkpvwvmwmrgeqe
qrgtqrgdvlpnadetWYLRATLDVAAGEAAGLSCRVKHSSLGGHDLIIHWggysiflilicltvivtlvilvvvdsrlk
kqr*

FRAME 1 TRANSLATION:
mlllfllfeglccpgentaaaeeqlsfrmlqtssfanhswahsegsgwlgdlqthgwdtvlgtirflkpwshgnfskqel
knlqslfqlyfhsfiqivqasagqfqleypfeiqilagcrmnapqiflnmayqgsdflsfqgiswepspgagiraqnick
vlnryldikeilqsllghtcprflaglmeageselkrkvkpeawlscgpspgpgrlqlvchvsgfypkpvwvmwmrgeqe
qrgtqrgdvlpnadetwwifhlshpdlfdcdsypghigcs*ltvkktev

FRAME 2 TRANSLATION:
ccscssssrvsavlgkiqqqqrsscpsacsklpplpttaghtvraqdgwvtcrlmagtlswapsaf*spgpmetsasrs*
ktyshcssytsivlsr*cklllvnfslntpsrsry*lave*mphkss*iwhikgqis*vskefpgshlqeqgsgprtsvk
csiat*ilrkyckaflvtpaldf*rgswkqgsqn*ngk*sqrpgcpvapvlalavcslcamsqdstqspcg*cgcgvsrs
sgalsegtsclmltrhggysiflilicltvivtlvilvvvdsrlkkqr*

FRAME 3 TRANSLATION:
aapvpplrgsllswgkysSsrgaavlphapnfllcqpqlgtq*glrmag*padswlghclghhplsealvpwklqqagae
kltvtvpvilp*fypdsasfcwsisa*Iplrdpdiswl*necptnllkygisrvrfpefprnflgaisrsrdpgpehl*s
aqslpry*gntakpswshlpsissgahgsrgvrteteSearglavlwpqswpwpsaacvpclrilpkarvgdvdag*aga
aghsargrpa*c*rdmvdipsfss*sv*l**lpwsywl*lthg*knrg

END

Comment: In this case exon four is truncated at the 3' end in the alternative form by 104 nucleotides (exon 4 in the normal form is "1836..2114", while in the alternative form it is "1836..2010"). This corresponds to 34 and 2/3 codons, and thus a frame shift is introduced. These codons (roughly) are indicated by the use of upper case in the annotated translation. The translation of the alternative mRNA is given in each of the three possible frames, and it is apparent that the translations in frame 2 and frame 3 are nonsense. The frame 1 translation is the same as the annotated translation up to the point in exon four where the truncation takes place. From this point the alternative form is seen to end with "wwifhlshpdlfdcdsypghigcs*", rather than "WYLRATLDVAAGEAAGLSCRVKHSSLGGHDLIIHWggysiflilicltvivtlvilvvvdsrlkkqr*" seen in the normal form. This is an example of a frame breaking alternative splicing event resulting in an alternative C-terminus.

Example two:

>IDB60241
anot:   1..198,280..413,1497..1607,1925..2019
alt:    1..198,1497..1607,1925..2019
over:   1..198,1497..1607,1925..2019
ctx:    ~79..198,1497..1607,1925..~2018
cds:    72,1948
ANNOTATED PROTEIN:
marslvclgviillsafsgpgvrggpmpkladrklcadqecsHPISMAVALQDYMAPDCRFLTIHRGQVVYVFSKLKGRG
RLFWGGSvqgdyygdlaarlgyfpssivredqtlkpgkvdvktdkwdfycq*

FRAME 1 TRANSLATION:
rergrggnwrpqhplahslahsprwpgpwcalvssscclpspdlvsgvvlcpswltgscvrtrsaavqgdyygdlaarlg
yfpssivredqtlkpgkvdvktdkwdfycq*aqptagpavsppwvyantispvq

FRAME 2 TRANSLATION:
greggeeigdpstplltllltvhdgpvpgvpwchhlavcllrtwcqgwsyaqag*peavcgpgvqPfreitmeiwllawa
ispvalsertrp*nlaksm*rqTngistaselslplalpfpllgfmqiqsaqck

FRAME 3 TRANSLATION:
geregrkletpappcslscsqstmarslvclgviillsafsgpgvrggpmpkladrklcadqecsRsgrllwrsgcspgl
fpq*hcprgpdpetwqsrcedrQmgfllpvssayrwpcrfpslglckynqpsan

END

Comment: In this case the alternative form skips exon two (280..413). This exon is 133 nucleotides in length (44 and 1/3 codons), and thus the alternative form contains a frame shift. The frame 3 translation of the alternative mRNA picks up the annotated form after translation of the 5' UTR. Downstream of the skipped exon, visible in this case by the presence of an uppercase 'R', the protein takes on an alternative translation, terminating in a stop codon after 17 amino acids.


fc - 14/9/2001