Background High-throughput gene expression data can predict gene function through the

Background High-throughput gene expression data can predict gene function through the guilt by association principle: coexpressed genes are likely to be functionally associated. expression data, functional annotation and known phenotype-gene associations we provide candidate genes for several genetic diseases of unknown molecular basis. Introduction Among the open problems of molecular biology the functional annotation of the human genome and the identification of genes involved in genetic diseases are especially important. Expression data on a genomic scale have buy 315694-89-4 been available for several years thanks to various experimental techniques, and are widely believed to contain a wealth of information relevant to the solution of such problems. Functional annotation based on expression data is usually founded on the guilt by association principle: since there is a strong correlation between coexpression and functional relatedness, a gene found to be coexpressed with several others involved in a given biological process can be predicted to be involved in the same process [1]C[3]. Recent systematic studies have demonstrated the soundness of the approach [4], [5]. Typically the analysis proceeds in three methods: (1) definition of a quantitative measure of dissimilarity between manifestation profiles, (2) recognition of groups of coexpressed genes, using clustering algorithms (3) practical analysis of these organizations using a controlled annotation vocabulary such as Rabbit Polyclonal to GJA3 the Gene Ontology (Proceed) [6], [7]. With this work we analyze human being normal tissues manifestation data with a procedure combining data acquired with different experimental techniques, and interpreted with different definitions of coexpression, into a unified platform. Thanks to buy 315694-89-4 a stringent definition of practical characterization this approach allows the generation of a large set of high-confidence predictions in terms of practical annotation and the recognition of new candidate disease genes. The special features of our approach are: Integration of different datasets and steps of coexpression. The operating hypothesis behind this strategy is that different experimental techniques and different definitions of dissimilarity steps explore different aspects of coexpression, and therefore can be combined to maximize the useful info acquired. Use of a rank-based process to generate groups of coexpressed genes (Rated Coexpression Organizations – RCG), without clustering algorithms. Use of the majority rule to determine the practical characterization of the RCGs. Such highly stringent criterion allows the generation of high-confidence practical predictions within the genes included in the functionally characterized RCGs. The Ranked Coexpression Organizations were generated from publicly obtainable manifestation data on human being normal tissues acquired with Affymetrix microarrays and SAGE; for the microarray data we used Euclidean distance and Pearson linear dissimilarity, while for SAGE we also used two steps of coexpression based on the Poisson distribution and originally launched in [8] inside a different context. The RCGs identified to be functionally characterized by the majority rule were then used for two purposes: to generate high-confidence practical predictions for the genes included in the functionally characterized RCGs to identify promising new candidate disease genes for OMIM [9] phenotypes of unfamiliar molecular basis, but for which one or more genetic loci have been recognized. These predictions are based on the co-occurrence in functionally characterized RCGs of genes known to cause similar phenotypes Results and buy 315694-89-4 Discussion Rated Coexpression Organizations With this work we regarded as gene manifestation data derived from human being normal cells with Affymetrix microarrays and with SAGE, but the techniques we used are readily generalized to any high-throughput gene manifestation platform. Given a set of manifestation data and a quantitative measure of coexpression, for each gene in the dataset we defined a Rated Coexpression Group as the gene itself together with the genes the majority of closely coexpressed with it. Consequently from a gene manifestation dataset and a quantitative measure of coexpression we generated a number of RCGs equal to the number of genes in the dataset, each containing of the genes it contains share a functional annotation (Proceed term). If a RCG was found to be functionally characterized by a GO term, we assigned the same term to all the genes in the RCG which.