A permutation test for determining significance of clusters with applications to spatial and gene expression data

https://doi.org/10.1016/j.csda.2009.05.031Get rights and content

Abstract

Hierarchical clustering is a common procedure for identifying structure in a dataset, and this is frequently used for organizing genomic data. Although more advanced clustering algorithms are available, the simplicity and visual appeal of hierarchical clustering have made it ubiquitous in gene expression data analysis. Hence, even minor improvements in this framework would have significant impact. There is currently no simple and systematic way of assessing and displaying the significance of various clusters in a resulting dendrogram without making certain distributional assumptions or ignoring gene-specific variances. In this work, we introduce a permutation test based on comparing the within-cluster structure of the observed data with those of sample datasets obtained by permuting the cluster membership. We carry out this test at each node of the dendrogram using a statistic derived from the singular value decomposition of variance matrices. The p-values thus obtained provide insight into the significance of each cluster division. Given these values, one can also modify the dendrogram by combining non-significant branches. By adjusting the cut-off level of significance for branches, one can produce dendrograms with a desired level of detail for ease of interpretation. We demonstrate the usefulness of this approach by applying it to illustrative datasets.

Introduction

Clustering and classification play an important role in many branches of science (Everit, 1993, Gordon, 1999). Assuming that we can define distances between objects, clustering is done on the basis of some dissimilarity measure. Specifically, we have a collection of n objects in p dimensional space, and we have a distance measure defined on this space. We are interested in identifying the cluster structure, if any, inherent in the data. A popular approach in cluster analysis is hierarchical clustering. In this agglomerative method, one starts with n objects, each as its own cluster, and combines the “closest” clusters into a larger one. This is done successively until a single cluster emerges. The Euclidean distance is perhaps the most commonly used distance, but the choice of distance measure depends on the type of data and the particular features in which one is interested. Pearson’s correlation coefficient is often used, for example in genetics and time series, although it is not a “distance measure” in the strict sense.

While a hierarchical clustering algorithm can be implemented easily, interpretation of the resulting dendrograms is more difficult. It is well known that they suffer from many undesirable properties, such as non-uniqueness and inversion (Morgan and Ray, 1995), as well as sensitivity to outliers and small perturbations in the data (Cheng and Milligan, 1996). Also, the ordering of the leaves of a dendrogram contains some arbitrariness and can be misleading, although some efforts have been made towards a better arrangement algorithm (Bar-Joseph et al., 2001). Thus, the interpretation of a given dendrogram requires caution. Most importantly, by the nature of the algorithm, hierarchical clustering will always yield clusters even when no clear cluster structure exists. It is therefore very helpful to assign significance to the apparent clustering in order to determine whether the dendrogram reliably describes the true structure of the data. While there are methods for determining when to cut branches of the dendrogram, these rules generally rely on a prespecified branch height, with all connected branches below considered to be separate clusters. Dynamic tree cutting (Langfelder et al., 2008) has been developed as a more flexible method for cluster identification, but there is no measure of significance associated with the resultant dendrogram.

Despite its many weaknesses, hierarchical clustering is pervasive in genomic data analysis, especially for gene expression data from microarrays. Its intuitive methodology, ease of implementation and simple, attractive visualization in two dimensions have made it popular with biologists and clinicians. A large number of more complicated and statistically sound algorithms from multivariate statistics and machine learning have been developed and compared (Gibbons and Roth, 2002). One reasonable approach has been based on the idea of consensus clustering (McShane et al., 2002, Monti et al., 2003), where multiple clustering runs are made with perturbed data and these are averaged to give the final result. Various methods have been used to generate perturbed data and to summarize results, including parametric bootstrap (Zhang and Zhao, 2000) and adding Gaussian noise (Kerr and Churchill, 2001, McShane et al., 2002). Machine learning algorithms for clustering include those based on genetic algorithms (Falkenauer, 1998) (see Di Gesu et al. (2005) for an example), neural networks, and Bayesian model selection (MacKay, 2003) (see Grotkjaer et al. (2006) for an example). Gene clustering methods have also been extended to situations involving repeated measure data (see Yeung et al. (2003) for an example). Yet, the same standard method first used for expression data several years ago (Eisen et al., 1998) continues to be used pervasively in the current literature.

Given the continued popularity of the simple hierarchical method, we introduce a method that will facilitate the interpretation of the result in that framework. We propose a permutation test that measures the significance of each division in a dendrogram. The method is based on the observation that if the clustering algorithm correctly finds the clusters, the within-cluster scatter will be relatively small. We compute a statistic that summarizes these variances and obtain a p-value based on this statistic. While the “tightness” of certain clusters can be sometimes discerned by visual inspection of the dendrogram, i.e., by comparing the lengths of different branches, this is not always reliable. With the permutation test, we are able to assign a level of significance for the division into two branches at each node in a systematic way, based on the original data. The method we introduce does not assume any parametric form for the distribution of the data, nor do we preselect the number of clusters in the data as is done, for example, in the Hierarchical Ordered Partitioning and Collapsing Hybrid (hopach) approach (van der Laan and Pollard, 2003). The method we propose is a simple way of annotating the dendrogram to indicate a degree of confidence in each node. We deal with a single tree from one clustering algorithm in this work, but it can also be applied to a consensus tree or to other situations.

Section snippets

Methods

After performing hierarchical clustering and obtaining a dendrogram, one has below each node a subset of the data divided into two clusters. We compute a statistic that measures the “goodness” of this apparent clustering. This procedure is similar, in spirit, to the multiresponse permutation procedure (MRPP) (Mielke et al., 1976, Good, 2005), which uses a permutation procedure on between-object distances to evaluate the hypothesis of non-random clustering. We propose that this statistic be

Applications

Spatial data. In the analysis of spatial data, one is often interested in determining if there is any clustering of cases. This clustering can represent any number of interesting aspects, such as excess cases of disease (Bonetti and Pagano, 2005), point sources of exposure (Meselson et al., 1994) or the existence of health disparities within a community (Schulz et al., 2002). Consider the spatial data collected by Alt and Vach (1995) who map a medieval burial site in Neresheim,

Discussion

We have introduced a method to annotate dendrograms with a statistic that supplies additional information on the significance of each node. The length of the branches supplies some information already, but it is not a reliable measure. It is clear from the examples we provide that the p-values may have little to do with the branch lengths. When the number of remaining objects in a branch is small, the p-value should be interpreted with some caution since the number of permutations possible may

Acknowledgements

We thank the reviewer for insightful comments which greatly helped in improving this paper. This work was supported in part by grants ST32-AI07358 for P.J.P., T32 AI007535-10 for J.M., and R01-EB006195 for M.P. and M.B. from the National Institutes of Health.

References (38)

  • K. Alt et al.

    Odontologic kinship analysis in skeletal remains: Concepts, methods, and results

    Forensic Sciences International

    (1995)
  • A. Gordon

    Identifying genuine clusters in a classification

    Computational Statistics and Data Analysis

    (1994)
  • Z. Bar-Joseph et al.

    Fast optimal leaf ordering for hierarchical clustering.

    Bioinformatics

    (2001)
  • M. Bittner et al.

    Molecular classification of cutaneous malignant melanoma by gene expression profiling

    Nature

    (2000)
  • M. Bonetti et al.

    The interpoint distance distribution as a descriptor of point patterns, with an application to spatial disease clustering

    Statistics in Medicine

    (2005)
  • R. Cheng et al.

    Measuring the influence of individual data points in a cluster analysis

    Journal of classification

    (1996)
  • V. Di Gesu et al.

    GenClust: A genetic algorithm for clustering gene expression data

    BMC Bioinformatics

    (2005)
  • M. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proceedings of the National Academy of Sciences

    (1998)
  • B. Everit

    Cluster Analysis

    (1993)
  • E. Falkenauer

    Genetic Algorithms and Grouping Problems

    (1998)
  • F.D. Gibbons et al.

    Judging the quality of gene expression-based clustering methods using gene annotation.

    Genome Research

    (2002)
  • P. Good

    Permutation, Parametric, and Bootstrap Tests of Hypotheses

    (2005)
  • A. Gordon

    Classification

    (1999)
  • T. Grotkjaer et al.

    Robust multi-scale clustering of large DNA microarray datasets with the consensus algorithm

    Bioinformatics

    (2006)
  • R. Johnson et al.

    Applied Multivariate Statistical Analysis

    (1982)
  • M.K. Kerr et al.

    Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments

    Proceedings of the National Academy of Sciences of United States of America

    (2001)
  • P. Langfelder et al.

    Defining clusters from a hierarchical cluster tree: The Dynamic Tree Cut package for R

    Bioinformatics

    (2008)
  • R. Ling

    A probability theory of cluster analysis

    Journal of the American Statistical Association

    (1973)
  • D. MacKay

    Information Theory, Inference and Learning Algorithms

    (2003)
  • Cited by (21)

    • Statistically validated hierarchical clustering: Nested partitions in hierarchical trees

      2022, Physica A: Statistical Mechanics and its Applications
    • Associations between neighborhood socioeconomic cluster and hypertension, diabetes, myocardial infarction, and coronary artery disease within a cohort of cardiac catheterization patients

      2022, American Heart Journal
      Citation Excerpt :

      These clusters were sufficiently large and qualitatively distinct from each other. We performed multivariable analysis of variance (MANOVA) tests at each branch to ensure clusters were statistically distinct; all included clusters had P-values <.05.20 We mapped neighborhood clusters at the BG level using ArcGIS version 10.3 (ESRI).

    • Assessment of variability in secondary metabolites and expected response to genotype selection in fenugreek (Trigonella spp.)

      2018, Industrial Crops and Products
      Citation Excerpt :

      Such strong associations make the process of selecting the superior genotypes more efficient and stabilize simultaneously the improvement of such metabolites. The structure of a population can be dissected through identifying similarities of individuals in hierarchical clustering of the population into subset of similar individual plants (Park et al., 2009). In the present study, the results of cluster analysis indicated existence of high variability among fenugreek accessions.

    • Representation of memories in the cortical–hippocampal system: Results from the application of population similarity analyses

      2016, Neurobiology of Learning and Memory
      Citation Excerpt :

      Our previous similarity analyses revealed substantial trial-to-trial differences in coding strength by using the instantaneous firing rates observed on each trail rather than the trial-averaged mean firing rate vectors (McKenzie et al., 2014). Dendrogram analysis requires large simultaneous ensembles to ensure adequate coverage of the representational space by the sampled ensemble, though with sufficient sampling, this too can be done on a trial-to-trial basis and statistical methods exist for assessing the reliability of the clustering structure (Park et al., 2009). The more important issue missed in RSA, however, relates to the underlying assumption that neurons encode stimuli by firing at a particular rate and variations from that set-point reflect noise.

    • Bioinformatic and Statistical Analysis of Microbiome Data: From Raw Sequences to Advanced Modeling with QIIME 2 and R

      2023, Bioinformatic and Statistical Analysis of Microbiome Data: From Raw Sequences to Advanced Modeling with QIIME 2 and R
    View all citing articles on Scopus
    View full text