Computational deconvolution: extracting cell type-specific information from heterogeneous samples

https://doi.org/10.1016/j.coi.2013.09.015Get rights and content

The quanta unit of the immune system is the cell, yet analyzed samples are often heterogeneous with respect to cell subsets which can mislead result interpretation. Experimentally, researchers face a difficult choice whether to profile heterogeneous samples with the ensuing confounding effects, or a priori focus on a few cell subsets of interest, potentially limiting new discoveries. An attractive alternative solution is to extract cell subset-specific information directly from heterogeneous samples via computational deconvolution techniques, thereby capturing both cell-centered and whole system level context. Such approaches are capable of unraveling novel biology, undetectable otherwise. Here we review the present state of available deconvolution techniques, their advantages and limitations, with a focus on blood expression data and immunological studies in general.

Introduction

The cellular composition of many biological samples is heterogeneous and varying, that is multiple cell-type subsets are present in each sample, and different samples show high variance between one another in relative cell subset proportions (from hereon, heterogeneous sample). Moreover, many physiological and pathological processes involve cell motility (e.g. infiltration) and differentiation, ultimately resulting in marked shifts in sample cell subset composition (Figure 1a). An example of importance is peripheral blood, which is composed of many different immune cell subsets, comprising the fundamental units of a complex system, whose state and interaction activity reflect the type of biological processes taking place whether in health or disease. The ability to measure and interpret phenotypic changes between specific conditions at the cell subset level is therefore critical to obtain a detailed understanding of the role of each cell subset within the immune system.

While the problem of sample heterogeneity has long been acknowledged 1, 2••, 3, researchers have struggled between the choice of focusing on a single cell subset or ignoring the problem and assaying heterogeneous samples. Whereas cell-isolation entails a loss of a systems perspective (i.e. biologically meaningful changes happening in multiple cell subsets and between them), ignoring sample heterogeneity entails misleading and difficult to interpret results. This has been an especially sore point in genomic scale studies (predominantly whole genome gene expression), where it is easy to lose sight of the natural cellular-context of the data amongst thousands of measured features and draw incorrect conclusions.

An emerging solution to this dilemma comes in the form of computational deconvolution methodologies, capable of extracting cell type-specific information directly from data generated from heterogeneous samples. Research addressing this issue started with the pioneering work of Venet et al. [4], but received relatively little attention. More recently, the increased application of genomic tools to human samples, which exhibit high sample heterogeneity, has spurred further developments. In particular, the availability of data from large efforts that profiled the gene expression of multiple known cell subsets and highlighted distinct differences between them 5••, 6•• both motivated such computational algorithm development and served as useful input for them.

This article reviews this active area of research and draws attention on the advantages and limitations of the different proposed approaches. For the sake of simplicity, we focus on peripheral blood, which is the primary source of samples in human immunology studies, and in particular on gene expression data and group differences analyses. However research has concomitantly focused on other tissue and data types. Eventually, such techniques have the potential to provide a valuable cell-centered view of the immune system, bringing new insights into inter and intra cellular dynamics, and cell subset states in health and disease.

Section snippets

The confounding effects of sample cell-type subset heterogeneity

Most genes are expressed to varying degrees across multiple cell subsets in a tissue or organism, implying that the measured abundance of any such transcript is confounded by the composition of the sample from which it is measured. More precisely, we may break down the total measured abundance of a gene in a sample into three Abundance Components: (1) that due to the characteristic condition of a sample (e.g. disease type, etc.), (2) that due to the individual variation, genotype-specific or

Extracting cell type-specific information from heterogeneous tissue

An attractive approach for gaining insight on cell-subset specific information is to estimate the proportion and/or gene expression profile of different cell subsets directly from the heterogeneous samples via computational methodologies, thereby preserving the whole-systems perspective, as well as obtaining an unbiased cell-based view (Figure 2).

Present computational methodologies for extracting cell type-specific information from heterogeneous sample data may be divided into five main method

Gain in biological insights

Computational deconvolution methods aim at providing a cell-centered view of heterogeneous molecular data, by decoupling the effect of proportion from cell type-specific phenotype. In particular, they have the potential to mine high-throughput data in a way that even upcoming laboratory techniques may not yet or ever handle, for example, due to limitations of cell surface markers for cell-sorting or to the mere unavailability of biological material for past studies. Notably, they have already

Limitations of computational approaches

Despite the various successful application of computational deconvolution methodologies, we believe several open issues remain to be investigated before they become widely adopted. First, a better understanding of the accuracy lower bound for estimates of cell subsets proportions or differential gene expression detection must be developed. This general accuracy is difficult to assess because of the many factors to consider (proportion dependencies, individual variation, clinical condition,

Conclusion

Cell subset heterogeneity is inherent to most primary biological samples, which may confound downstream data analysis if it is not taken into account and strongly restricts result interpretability. From a systems immunology perspective to health and disease, it is critical to be able to assess each cell subset's state and interactions, over a range of condition and molecular environment. In this respect, computational deconvolution methodologies showed to be powerful tools capable of providing

References and recommended reading

Papers of particular interest, published within the period of review, have been highlighted as:

  • • of special interest

  • •• of outstanding interest

Acknowledgements

This work was supported by US National Institutes of Health (NIH) (U19 AI057229). SSO is a Taub Fellow. RG is supported by the Lady Davis Fellowship.

References (40)

  • D. Ramsköld et al.

    Full-length mRNA-Seq from single-cell levels of RNA and individual circulating tumor cells

    Nat Biotechnol

    (2012)
  • D.A. Barbie et al.

    Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1

    Nature

    (2009)
  • R.G.W. Verhaak et al.

    Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1

    Cancer Cell

    (2010)
  • C.R. Bolen et al.

    Cell subset prediction for blood genomic studies

    BMC Bioinformatics

    (2011)
  • J.E. Shoemaker et al.

    CTen: a web-based platform for identifying enriched cell types from heterogeneous microarray data

    BMC Genomics

    (2012)
  • M.D. Chikina et al.

    Global prediction of tissue-specific gene expression and context-dependent gene networks in Caenorhabditis elegans

    PLoS Comput Biol

    (2009)
  • A.R. Abbas et al.

    Deconvolution of blood microarray data identifies cellular activation patterns in systemic lupus erythematosus

    PLoS ONE

    (2009)
  • T. Gong et al.

    DeconRNASeq: a statistical framework for deconvolution of heterogeneous tissue samples based on mRNA-Seq data

    Bioinformatics (Oxford, Engl)

    (2013)
  • S. Song et al.

    qpure. A tool to estimate tumor cellularity from genome-wide single-nucleotide polymorphism profiles

    PLoS ONE

    (2012)
  • S.L. Carter et al.

    Absolute quantification of somatic DNA alterations in human cancer

    Nat Biotechnol

    (2012)
  • Cited by (0)

    View full text