MATCHING THE MOUSE EXON CONSTRUCTS WITH THE MOUSE GENOME SEQUENCE.

Summary so far is: for every considered human splice junction, an exon construct was made using the flanking exon regions. Mouse transcript sequences that make acceptable matches with such human exon constructs were identified. False positives due to mouse transcript sequences arising from the duplicate rather than the intended gene were removed. This analysis led to identification of 1198 human exon constructs each of which finds an acceptable match with one or more mouse transcript sequences.

Though the above analysis indicates that the matching mouse exon construct is expressed in mouse, it is required to ascertain that the splice junction actually exists in mouse genes. For this purpose, we checked whether the matching mouse exon constructs make gapped-alignments on the mouse genome. For every human splice junction that made an acceptable match with mouse transcript sequences, the matching mouse exon construct was retrieved and was examined for the ability to make a gapped-alignment with the mouse genome sequence. For this purpose of genome-wide search, we used SSAHA (Sequence Search and Alignment by Hashing Algorithm) software of mouse ensembl resources (Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA databases. Genome Res. 11, 1725-1729 (2001) http://www.ensembl.org/Mus_musculus/ssahaview). The mouse genome draft sequence is of release MGSC v3 (Jan 2003)with 2740 contigs with a sequence quality of 7X coverage and with an estimated coverage of 96% of the total DNA. SSAHA is a software tool for very fast matching and alignment of DNA sequences to identify exact or 'almost exact' matches; it achieves its fast search speed by converting sequence information into a `hash table' data structure, which can then be searched very rapidly for matches.

The SSAHA matches were checked for the following criteria: (1) The 5' and 3' exon regions of the construct match to the same chromosome region with a gap; (2) the orientation of the matches (forward-forward; reverse-forward) remains the same for the 5' and 3' exon region regions of a construct and for all the constructs from the same human gene; and (3) All the splice junctions from a human gene map to the same chromosome region in the same positional order as that in the human gene. It was generally seen that there was a comparable degree of order between the length of the gap in the matches of the 5' and 3' exon regions and that of the human intron (Perhaps, we can do a scatter plot or something like that). Of the 1198 human splice junctions checked, the SSAHA matches showed gapped alignment in 1134 cases. For 25 of the remaining 64 cases, use of regular BLAST showed matches of the above types (however, it is to be admitted that in 11 of these 25 cases, the average percent of identity is less than 96%. But the related splice junctions mapped to the same chromosome regions with positional compatability).

The remaining 39 human splice junctions (35 were constitutive and 4 were alternative) for which the mouse constructs failed to show gapped alignment with mouse draft genome sequence were carefully examined. In one case, the splice junction mapped with a gapped alignment with mouse sequence but in a region different from other related splice junctions; in 25 cases (21 constitutive and 4 alternative), either the 5p or the 3p but not both the exonic regions of the splice junction mapped to the mouse sequence; in 6 cases, the mouse construct matched with the mouse sequence but without a gap; and in the remaining 7 cases, there were no match or no significant matches. The possible reasons for these anomalies are listed below:

(i) Only the 5p or the 3p exonic region match with the mouse draft sequence (25 entries): Considering that both the 5p and 3p exonic regions are observed in a mouse transcript sequence, the absence of a gapped alignment and the absence of the match of one of the 5p or 3p exonic region with mouse draft genome sequence can be due to omissions in the draft sequence (note that the current version covers only 96% of the total DNA) and/or due to the low sequence qualities in either the EST sequences or the genome sequence (note that the current version is only 7X coverage) leading to insignificant matches not picked up in our analysis. (just for record: 3 pairs of these splice junctions are involved in 3 conserved alternative event). It can not be a case of intron or exon loss or even due to variation in exon between human and mouse.

(ii) Ungapped alignment of the mouse construct with the mouse genome sequence (6 entries): All these 6 cases can be resultant of intron-loss phenomenon between human and mouse.

(iii) Unambiguous cases (7 entries) : These cases either made no significant hits or they were difficult to resolve unambiguously. It is probably mostly due to sequence quality issues.

Thus, while in 1159 (96.7%) of 1198 cases of splice junctions evidences can be obtained that the human intron occurs in mouse genes, only in 6 cases intron-loss can be demonstrated. Thus it can be said that it is largely the case that the human splice junctions whose exon constructs matched with mouse transcript sequences are also seen as splice junctions in mouse. Since the cases of intron loss along with the ambiguous cases constitute only a 3.3% of the data set, we decided not to remove them from the data set for further analysis.