Align cDNA/ESTs to genomic sequences:
- We used all sequences listed in the human, mouse, rat, chicken, and zebrafish UniGene databases that are associated with NCBI Gene IDs, and aligned them to genome sequences obtained from the UCSC Genome Bioinformatics Site. cDNAs (including RefSeq sequences) and ESTs were aligned to genomes using BLAT. We required 95% identity between a cDNA/EST sequence and a genome sequence.
- The transcriptional orientation of a sequence on the genome was determined by its splicing signal, i.e. GT-AG (5' splice signal-3' splice signal), GC-AG, or AT-AC for sense orientation, and CT-AC, CT-GC, or GT-AT for anti-sense orientation, and/or by its poly(A) tail since polyadenylation only occurs at the 3' end. If the orientation of a sequence indicated by splicing is in conflict with that by its poly(A) tail, the sequence is discarded.
Map poly(A) sites:
- cDNA/EST sequences aligned to genomic sequences were examined for poly(A) tails after the alignment. Unaligned sequences at both 5' and 3' termini of the cDNA/EST were checked for a stretch of T and A respectively. For the 3' end, a sequence is considered to have a poly(A) tail if after the unaligned position (1) the sequence contains 8 or more consecutive As, or (2) if it has one other nucleotide, it has 8 or more consecutive As after the other nucleotide. The criteria are the same for the 5' end except that consecutive Ts are searched.
- To address the internal priming issue, the genomic sequence -10 to +10 nt surrounding the cleavage site was examined. If the sequence has 6 consecutive As or more than 7 As in a 10 nt window, it is considered as an internal priming candidate.
Group poly(A) sits according to locations and genes:
- We group together poly(A) sites belonging to the same gene using NCBI UniGene database. To eliminate anti-sense transcripts and other erroneous transcripts, we clean up UniGene Bins and called the cleaned UniGene Bin CLUB (CLeaned UniGene Bin). This step is carried out first by selecting a representative sequence called initiator for the CLUB, followed by iteratively including cDNA/ESTs that have the same transcriptional orientation as the initiator and have sequence overlap with the cDNA/ESTs already in the CLUB. Initiator is selected based on the order RefSeqs > other cDNAs > ESTs. Sequences included in a CLUB are called CLUB members. One UniGene Bin may have more than one CLUB.
- To maximize the number of supporting sequences for a poly(A) site, the 3’ ends of sequences without poly(A/T) tails are compared with identified poly(A) sites. A cDNA/EST is considered to be supporting a poly(A) site if its 3’ end is near the poly(A) site within 24 nt. Transcripts with unknown transcriptional orientation can be assigned as associated CLUB members if one of their sequence ends is near a poly(A) site within 24 nt, and the inferred transcriptional orientation based on the poly(A) is in agreement with the orientation of the CLUB. They are also included as supporting cDNA/ESTs.
- Poly(A) sites that are located within 24 nt from one another, due to heterogeneous cleavage, are clustered together in the 5’ to 3’ direction. The position of the first cleavage site in a cluster is used to represent the cluster, or poly(A) site. The number of cDNA/ESTs associated with a poly(A) signal is the sum of all cDNA/ESTs supporting all cleavage sites in the cluster.
- A poly(A) site ID is composed of three parts, i.e. UniGene ID, CLUB number, and site number. For example, Hs.44402.1.46 is based on UniGene ID Hs.44402, CLUB number 1, and site number 46.
Use Trace sequences.
- Trace sequences were downloaded from NCBI Trace Archive. Each Trace sequence was compared with its corresponding cDNA/EST by BLAST. Regions in the Trace sequence that were not present in the cDNA/EST sequence were retrieved and inspected for poly(A/T) sequences.
- To ensure quality, we discarded Trace sequences whose poly(A/T) tail information conflicted with that of their corresponding cDNA/ESTs, e.g. poly(T) sequence in Trace sequence but poly(A) sequence in EST, or the opposite situation. Poly(A) tail information is then added to the cDNA/EST for poly(A) site identification.
Gene View:
- Gene view contains information about genes and their poly(A) sites, including Gene ID, Organism, Official Symbol, Gene Name, RefSeq squence(s), Chromosome number, UniGene ID, and Strand of transcription. Most of these are obtained from NCBI Gene database and NCBI UniGene database.
- For each gene, the gene structures of its RefSeq sequences are plotted. ORFs, UTRs, and poly(A) sites are indicated in the graph.
- Poly(A) sites are listed with links to site view, syntenic view, and cis element view.
- Links to cDNA/EST evidence View, Ortholog View, Library View, and PAS view are provided.
- Gene View is available for human, mouse, rat, chicken, and zebrafish genes, can be queried using Gene IDs or UniGene IDs.
Site View:
- Site View contains information about individual poly(A) sites. A poly(A) site ID starts with a UniGene ID followed by 2 numbers reflecting the poly(A) site identification process. Each site has a corresponding Gene ID, a chromosome number, and its position on the chromosome.
- Since each site may have several cleavage sites, the left-most and right-most cleavage locations are indicated, as well as the number of cleavage sites.
- The maximum length of poly(A/T) tail based on all supporting cDNA/ESTs and the maximum length of poly(A/T) tail based on all supporting cDNA/ESTs + Trace sequences are provide to help determine the authenticity of a poly(A) site. Accession numbers for all supporting cDNA/ESTs are listed.
- The genomic sequence -125 to +125 nt flanking the site is shown. Upstream sequences are indicated by ‘<’, downstream sequences are indicated by ‘>’, and sequences between left-most and right-most cleavage sites are indicated by ‘-‘.
cDNA/EST evidence View:
- cDNA/EST evidence View shows all supporting cDNA/ESTs for all poly(A) sites of a gene, both in a table listing their Accession numbers, and in a graph showing their gene structures. RefSeq sequences are also included in the graph (shown at the top) for comparison purposes.
Ortholog View:
- Ortholog View shows orthologous groups among human, mouse, rat and chicken genes. The data are based on NCBI HomoloGene database. For each gene, gene structure of a representative sequence (initiator, see above for details) and all poly(A) sites are indicated in a graph.
PAS View:
- Polyadenylation Signals for all poly(A) sites of a gene are shown in a table. AAUAAA, AUUAAA, and 11 other single nucleotide variants, i.e. AGUAAA, UAUAAA, CAUAAA, GAUAAA, AAUAUA, AAUACA, AAUAGA, ACUAAA, AAGAAA, AAUGAA, and UUUAAA, are shown in black, dark gray, and light gray, respectively.
Cis-element View:
- The -125 to +125 nt region of a poly(A) site is searched for 15 elements identified by a bioinformatic method described in (Hu et al, 2005, RNA).
- For each cis element, symbols are used to indicate similarity to its matching sequences: '+', very strong match; '|', strong match; ':', weak match; '.', very weak match, '-', no match.
- '<' is used to indicate sequence upstream of the site, and '>' down of the site.
Library View:
- Library View shows all cDNA library annotations for ESTs supporting poly(A) sites, including cDNA library title, tissue source, development stage, and cancer information. This information is derived from NCBI UniGene database.
Synteny View:
- For each human poly(A) site, we parse out a multiple genome alignment from the 8-way genome alignment file obtained from the UCSC Genome Bioinfomatics Site. Eight species are included in the alignment, including Homo sapiens, Pan troglodytes, Canis familiaris, Mus musculus, Rattus norvegicus, Gallus gallus, Danio rerio, and Takifugu rubripes. Upstream and downstream sequences of the human poly(A) site are indicated by '<' and '>', respectively.
|