- Aligning cDNA/ESTs to genomic sequences
- Mapping poly(A) sites
- Clustering poly(A) cleavage sites
- Gene types
- Signal mapping.
- Poly(A) site ID format
- Poly(A) site positions
- Poly(A) site supporting cDNA/ESTs
Aligning cDNA/ESTs to genomic sequences:
- We used all sequences listed in human and mouse UniGene databases (NCBI, March, 2004 versions) that are associated with LocusLink IDs, and aligned them to genome sequences (human genome Build 34.2 and mouse genome Build 32, both from NCBI). RefSeq mRNA and cDNA sequences (NCBI GeneBank March, 2004 release) and ESTs (NCBI dbEST March, 2004 release) were aligned to genomes using BLAST and MegaBLAST. We require 95% identity at the MegaBLAST stage to locate a gene and 90% identity at the BLAST stage to fill the gaps. The transcriptional orientation of a sequence on the genome was first determined by its splicing sites, e.g., GT.AG (5' and 3' splice sites, respectively) indicates a sense orientation whereas CT.AC indicates an anti-sense orientation, and/or its poly(A) tail since polyadenylation only occurs at the 3' end. If the orientation of a sequence indicated by splicing is in conflict with that by its poly(A) tail, the sequence is discarded. If neither piece of information can be obtained from a sequence, which automatically indicates that the sequence does not have a poly(A) tail (thus not of interest in this study), the sequence is also discarded.
- In this study, genes are represented by LocusLink entries, which were obtained from NCBI. RefSeq mRNA sequences were used to represent transcripts of genes. If a gene has more than one RefSeq sequence, their corresponding genomic regions are required to overlap, and their transcriptional orientations are required to be the same. Genes whose RefSeq sequences do not meet these two criteria are discarded. Thus each gene's orientation and genomic location can be unequivocally determined using its RefSeq(s). cDNA/ESTs that are associated with a gene, as listed in the UniGene database, are required to meet the following criteria: (1) A sequence's transcriptional orientation is required to be in agreement with that of its associated gene. (2) The genomic regions aligned with a cDNA/EST are required to overlap with those of RefSeq sequences of its corresponding gene by either a 32 nt sequence or an entire exon (either the cDNA/EST's or RefSeq's). This is to eliminate sequences incorrectly associated with LocusLink IDs in the UniGene database. Also, it discards sequences that reside in the intron region of other genes. The start and stop codon of a gene is located using the RefSeq GeneBank annotation file.
Mapping poly(A) sites:
- cDNA/EST sequences aligned to genomic sequences were examined for poly(A) tails after the alignment. Unaligned sequences at both 5' and 3' termini of the cDNA/EST were checked for a stretch of T and A respectively. For the 3' end, a sequence is considered to have a poly(A) tail if after the unaligned position (1) the sequence contains 8 or more consecutive As, or (2) if it has one other nucleotide, it has 8 or more consecutive As after the other nucleotide. The criteria are the same for the 5' end except that consecutive Ts are searched.
- For sequences that contain poly(A) tails, the poly(A) cleavage site on the genome is considered to be right after the last position of the alignment of cDNA/EST with the genome. To address the internal priming issue, the genomic sequence -10 to +10 nt surrounding the cleavage site was examined. If the sequence has six continuous As or more than 7 As in a 10 nt window, it is considered as internal priming candidate, similar to what was used by other groups (Beaudoing et al. 2000; Kan et al. 2000). However we found this criterion sometimes is too stringent. Thus if a poly(A) cleavage site is supported by more than one cDNA/EST and has one of the 12 PAS hexamers in -40 to -1 nt region (Beaudoing et al. 2000), it is believed to be a real site. We manually checked several poly(A) cleavage sites and found this method achieved better selectivity than other methods for detecting internal priming.
Cleavage site:
- Due to the imprecise nature of cleavage site (Pauws et al NAR 2001, Hajarnavis et al NAR 2004), we iteratively clustered poly(A) cleavage sites that are located next to each other within 24 nt, and used the first cleavage site to represent the cleavage site for a poly(A) signal. The number of cDNA/ESTs associated with a poly(A) signal is the sum of all cDNA/ESTs supporting its constituent cleavage sites. After clustering of cleavage sites, if a poly(A) signal has at least two supporting cDNA/EST sequences, or has a PAS hexamer (AAUAAA or 11 variants) in the -40 to -1 region of the poly(A) signal, it is considered to be a genuine poly(A) signal.
Gene types:
Signal mappings:
- AAUAAA and its 11 single base variants (Beaudoing et al Genome Res 2000) are mapped in the -1 to -40 nt region relative to each poly(A) site.
| Signal | Legend |
| AAUAAA |  |
| AUUAAA |  |
| AGUAAA |  |
| UAUAAA |  |
| CAUAAA |  |
| GAUAAA |  |
| AAUAUA |  |
| AAUACA |  |
| AAUAGA |  |
| ACUAAA |  |
| AAGAAA |  |
| AAUGAA |  |
- Positions are the 3' most nucleotide position of the hexamer to the poly(A) sites, which is set to 0.
Poly(A) site ID format:
- Poly(A) site ID, in the format of p.###.*, where ### is the LocusLink ID for the corresponding gene, * is a order number (ordered for all poly(A) sites of the gene from 5' to 3' of the transcript)
Poly(A) site position:
- Poly(A) site position is the coordinate relative to the contig where the gene is located on.
Poly(A) site supporting cDNA/EST:
- Supporting cDNA/ESTs are cDNA/ESTs with poly(A)/poly(T) tails supporting the usage of one poly(A) site.
|