莊樹諄副教授 (中央研究院基因體研究中心)

Guidelines/documentation of CNVVdb:

Outline:

1. Description of the CNVVdb parameters

2. Parameter settings

3. Explanations of columns in the results

1. Description of the CNVVdb parameters

1-1. Search:

1-1-1. Single Range: input of one target species, one chromosome, and a pair of genomic coordinates to search for paralogous sequences in the same species or orthologous sequences in the other species.

1-1-2. Multi Range: input of more than one genomic coordinates in one target species. The interface will use all of the input sequences to query against the genomes of the selected subject species.

1-2. Overlap: the proportion (%) of overlapped regions between the query sequence in the target species and the matched sequence in the subject species. A higher proportion of overlap will lead to fewer but more accurate matches.

1-3. Identity: the sequence identity (%) between the query sequence in the target species and the matched sequence in the subject species.

For guidelines of parameter settings, please refer to the next section.

2. Parameter settings:

Basically, sequence identity is the basis for identification of paralogues/orthologues in this interface. As a reference, segmental duplications are defined as ³90% sequence identity with ³1kb in size (Bailey, et al., 2004; Bailey, et al., 2001; Cheung, et al., 2003; Zhang, et al., 2005). In our interface, since the cross-species genome comparisons may be performed in such a range of distantly related species (from mammals to fishes), we provide a choice of the level of sequence identity for CNV identification to accommodate different research purposes. For example, the user can select a higher level of sequence identity to find recent duplications between closely related species (e.g., >94% for the human and chimpanzee comparison (Cheng, et al., 2005; Chimpanzee Sequencing and Analysis Consortium, 2005)). We suggest that a low sequence identity (e.g., ³60%) be selected at the beginning, so that a larger number of matches can be identified for further analyses than if a high sequence identity is used. For example, the working example (see the fifth part of the guidelines/documentation) shows that CNVs from primates and other mammals (mouse, dog, and horse) can be found if the sequence identity is set to be ³60%. If a higher sequence identity (e.g., ³90%) is used, only CNVs from primates remain. Furthermore, if we consider the CNVs with the sequence identity ³94% only, recent duplications among hominoids (i.e., human, chimpanzee, and orangutan) can be therefore identified. In addition, since genic regions may be subject to more stringent selection pressures than intergenic regions, genic regions tend to be more conserved across species. The evolutionary relationships among some model animals, the coordinates of Ensembl-identified genic regions for the eight target species, and summary of some general rules for parameter settings are given below. The users can adjust the parameters accordingly.

The relationships and approximate divergence times of some model animals:

Note: The branch lengths are not scaled to divergence times.

Coordinates of Ensembl-identified genic regions for the eight target species (.rar file)

Summary of some general rules for parameter settings:

(1) Since CNVVdb allows a minimal overlap of 80%, it is advisable that the user specify a short query length while searching between species with large genetic distances, particularly in the case of non-coding sequences. For example, a 10kb query sequence from a human non-coding region frequently returns no match from species other than the great apes.

(2) Use low overlap, and low identity, in addition to short query lengths, for non-coding sequences.

(3) The searching time is negatively correlated with the length of the query sequence and the number of subject species.

(4) The users can start with closely related species, and then extend the searching range to more distantly related ones, so that the users can have a better control of the parameters. For example, human-chimpanzee-orangutan, mouse-rat, and stickleback-zebrafish are appropriate starters.

(5) It is also suggested that the users start with short sequences with low overlap and low identity. Although this practice returns more noises, the users can always raise the thresholds to obtain more accurate results.

3. Explanations of columns in the results

There are four main parts in the CNVVdb results.

Part I. Information summary of the query region in the target species

(Note: for all columns, users can click the hyperlink to obtain more detailed information.)

This part provides the following information:

#Ensembl gene: number of Ensembl-annotated genes that overlap with the query sequence

#Ensembl transcript: number of Ensembl-annotated transcripts that overlap with the query sequence

#Ensembl protein: number of Ensembl-annotated proteins that overlap with the query sequence

#UCSC KnownGene (Transcript): number of UCSC-annotated transcripts that overlap with the query sequence

#NCBI Refseq (Transcript): number of NCBI-annotated transcripts that overlap with the query sequence

#Processed pseudogene: number of PseudoPipe-identified processed pseudogenes that overlap with the query sequence

#Duplicated pseudogene: number of PseudoPipe-identified duplicated pseudogenes that overlap with the query sequence

Evidence of CNVs between individuals: Whether the query sequence overlaps with the experimentally validated CNVs (including four CNV databases: DGV, 500KEA, Redon, and WGTP) between individuals. (Yes/No) (This result is provided for human only)

WGAC: Whether the query sequence that overlaps with the segmental duplications identified by the whole-genome assembly comparison (WGAC) method (Yes/No) (This result is provided for human, chimpanzee, and dog only)

#dbSNP: number of SNPs that overlap with the query sequence

#Watson SNP: number of SNPs (the reference genome vs. the Watson genome) that overlap with the query sequence (human only)

#Venter SNP: number of SNPs (the reference genome vs. the Venter genome) that overlap with the query sequence (human only)

Part II. Information summary of the matched region(s) in the subject species

This part includes three sub-parts:

(1) Summary table of the number of the matched CNVs for each subject species.

(2) Genome-wide visualization of the matched CNVs for each subject species. The graph shows the approximate localization of each matched region in the chromosomes of the subject species. The user can click on the species name of the table to view the corresponding graph. More detailed information can be obtained in Part III.

(3) Summary table of the number of the matched CNVs with additional information. The column explanations of this sub-part are similar to those in Part I.

Part III. Detailed information of each matched region in the subject species

The user can click on the “species tag” to see the information of the matched regions in this specific species.

View: clicking on the “Alignment” hyperlink will bring out the Blast pairwise alignment between the query sequence in the target species and the matched sequence in the subject species

Identity (%): sequence identity between the query sequence in the target species and the matched sequence in the subject species

Coverage (%): the proportion (%) of overlapped regions between the query sequence in the target species and the matched sequence in the subject species

Position of the query sequence in the target species:

Chr#: in which chromosome the query sequence is located

tStrand: on which strand (+ or -) the query sequence is located

tStart: the start coordinate of the query sequence

tEnd: the end coordinate of the query sequence

Detailed information about the matched sequence (Most of the results are hyperlinked to outside resources. The users can click on the hyperlinks, whenever available, to obtain more detailed information):

Chr#: in which chromosome the matched sequence is located

tStrand: on which strand the matched sequence is located

tStart: the start coordinate of the matched sequence

tEnd: the end coordinate of the matched sequence

Ensembl gene: the number of Ensembl-annotated genes that overlap with the matched sequence

Ensembl transcript: the number of Ensembl-annotated transcripts that overlap with the matched sequence

Ensembl protein: number of Ensembl-annotated proteins that overlap with the matched sequence

UCSC KnownGene (Transcript): the number of UCSC-annotated transcripts that overlap with the matched sequence

NCBI Refseq (Transcript): the number of NCBI-annotated transcripts that overlap with the matched sequence Processed pseudogene: the number of PseudoPipe-identified processed pseudogenes that overlap with the matched sequence

Duplicated pseudogene: the number of PseudoPipe-identified duplicated pseudogenes that overlap with the matched sequence

Evidence of CNVs between individuals: Whether the matched sequence overlaps with the experimentally validated CNVs (from four CNV databases: DGV, 500KEA, Redon, and WGAC) (Yes/No) (human only)

WGAC: Whether the query sequence overlaps with the segmental duplications identified by the whole-genome assembly comparison (WGAC) method (Yes/No) (human, chimpanzee, and dog only)

dbSNP: the number of SNPs that overlap with the matched sequence

Watson SNP: the number of SNPs (the reference genome vs. the Watson genome) that overlap with the matched sequence (human only)

Venter SNP: the number of SNPs (the reference genome vs. the Venter genome) that overlap with the matched sequence (human only)

Information about the original UCSC alignment that covers the query and matched sequences:

score: the identity score of the UCSC Blastz alignment

ot_Start: the start coordinate of the UCSC alignment that covers the query sequence in the target species

ot_End: the end coordinate of the UCSC alignment that covers the query sequence in the target species

oq_Start: the start coordinate of the UCSC alignment that covers the matched sequence in the subject species

oq_End: the end coordinate of the UCSC alignment that covers the matched sequence in the subject species

Part IV. This part includes two functions: downloading all the identified paralogous/orthologous sequences, and performing multiple sequence alignments among the query sequence and all paralogues/ orthologues.

4. Data resources

(1) The target-subject species orthologous alignments, human/dog self-alignments, UCSC-annotated transcripts, and RefSeq transcripts: the UCSC genome browser at http://genome.ucsc.edu/

(2) The self-alignments of chimpanzee, rhesus macaque, mouse; Rat, chicken, and stickleback: the system manager of the UCSC genome browser

(3) The gene annotation, GO descriptions, expression information, genomic sequences, and SNP data: the Ensembl genome browser (release 49) at http://www.ensembl.org/

(4) The polymorphism data of the human reference genome vs. the Venter genome: ftp://ftp.jcvi.org/pub/data/huref/

(5) The polymorphism data of the human reference genome vs. the Watson genomes: ftp://ftp.hgsc.bcm.tmc.edu/pub/uers/wheeler/

(6) The protein domain descriptions (Interpro (Mulder, et al., 2007), SMART, and PFAM), the KEGG pathways (Kanehisa, et al., 2008), and the disease association information (OMIM, HIV interaction, and the Genetic Association Database (Becker, et al., 2004)): the DAVID knowledgebase (Huang da, et al., 2007) at http://david.abcc.ncifcrf.gov/

(7) The Database of Genomic Variants (DGV) (Iafrate, et al., 2004): http://projects.tcag.ca/variation/

(8) The WGTG, 500KEA, and Redon CNV databases: http://www.sanger.ac.uk/humgen/cnv/redon2006/cnv_data/

(9) Duplications identified by the WGAC (whole-genome assembly comparison) approach (Bailey, et al., 2001): http://eichlerlab.gs.washington.edu/database.html

(10) The PseudoPipe program (Zhang, et al., 2006): http://www.pseudogene.org/utilities.html

(11) The MUSCLE package (Edgar, 2004; Edgar, 2004): http://www.drive5.com/muscle/download3.6.html

5. Working example

(1) The user sets the parameters and inputs the coordinates of the query sequence.

(2) The submitted job is under processing.

(3) The results (four parts)

Part I

Part II

Part III

Part IV

=============================================

Numbers of matches according to different levels of sequence identity

Subject species	Number of matches
Subject species	identity³60%	identity³70%	identity³80%	identity³90%	identity³94%
Human	7	7	7	7	7
Chimpanzee	4	3	3	3	3
Orangutan	2	2	1	1	1
Macaque	1	1	1	1	0
Dog	1	1	0	0	0
Mouse	1	0	0	0	0
Horse	1	1	0	0	0

6. References

Bailey, J.A., Church, D.M., Ventura, M., Rocchi, M. and Eichler, E.E. (2004) Analysis of segmental duplications and genome assembly in the mouse, Genome research, 14, 789-801.

Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J. and Eichler, E.E. (2001) Segmental duplications: organization and impact within the current human genome project assembly, Genome research, 11, 1005-1017.

Becker, K.G., Barnes, K.C., Bright, T.J. and Wang, S.A. (2004) The genetic association database, Nature genetics, 36, 431-432.

Cheng, Z., Ventura, M., She, X., Khaitovich, P., Graves, T., Osoegawa, K., Church, D., DeJong, P., Wilson, R.K., Paabo, S., Rocchi, M. and Eichler, E.E. (2005) A genome-wide comparison of recent chimpanzee and human segmental duplications, Nature, 437, 88-93.

Cheung, J., Wilson, M.D., Zhang, J., Khaja, R., MacDonald, J.R., Heng, H.H., Koop, B.F. and Scherer, S.W. (2003) Recent segmental and gene duplications in the mouse genome, Genome biology, 4, R47.

Chimpanzee Sequencing and Analysis Consortium (2005) Initial sequence of the chimpanzee genome and comparison with the human genome, Nature, 437, 69-87.

Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity, BMC bioinformatics, 5, 113.

Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic acids research, 32, 1792-1797.

Huang da, W., Sherman, B.T., Tan, Q., Kir, J., Liu, D., Bryant, D., Guo, Y., Stephens, R., Baseler, M.W., Lane, H.C. and Lempicki, R.A. (2007) DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists, Nucleic acids research, 35, W169-175.

Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W. and Lee, C. (2004) Detection of large-scale variation in the human genome, Nature genetics, 36, 949-951.

Kanehisa, M., Araki, M., Goto, S., Hattori, M., Hirakawa, M., Itoh, M., Katayama, T., Kawashima, S., Okuda, S., Tokimatsu, T. and Yamanishi, Y. (2008) KEGG for linking genomes to life and the environment, Nucleic acids research, 36, D480-484.

Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bork, P., Buillard, V., Cerutti, L., Copley, R., Courcelle, E., Das, U., Daugherty, L., Dibley, M., Finn, R., Fleischmann, W., Gough, J., Haft, D., Hulo, N., Hunter, S., Kahn, D., Kanapin, A., Kejariwal, A., Labarga, A., Langendijk-Genevaux, P.S., Lonsdale, D., Lopez, R., Letunic, I., Madera, M., Maslen, J., McAnulla, C., McDowall, J., Mistry, J., Mitchell, A., Nikolskaya, A.N., Orchard, S., Orengo, C., Petryszak, R., Selengut, J.D., Sigrist, C.J., Thomas, P.D., Valentin, F., Wilson, D., Wu, C.H. and Yeats, C. (2007) New developments in the InterPro database, Nucleic acids research, 35, D224-228.

Zhang, L., Lu, H.H., Chung, W.Y., Yang, J. and Li, W.H. (2005) Patterns of segmental duplication in the human genome, Molecular biology and evolution, 22, 135-141.

Zhang, Z., Carriero, N., Zheng, D., Karro, J., Harrison, P.M. and Gerstein, M. (2006) PseudoPipe: an automated pseudogene identification pipeline, Bioinformatics (Oxford, England), 22, 1437-1439.