NeuroExpresso: A cross-laboratory database of brain cell-type expression profiles with applications to marker gene identification and bulk brain tissue transcriptome interpretation

The identification of cell type marker genes, genes highly enriched in specific cell types, plays an important role in the study of the nervous system. In particular, marker genes can be used to identify cell types to enable studies of their properties. Marker genes can also aid the interpretation of bulk tissue expression profiles by revealing cell type specific changes. We assembled a database, NeuroExpresso, of publicly available mouse brain cell type-specific gene expression datasets. We then used stringent criteria to select marker genes highly expressed in individual cell types. We found a substantial number of novel markers previously unknown in the literature and validated a subset of them using in silico analyses and in situ hybridization. We next demonstrate the use of marker genes in analysis of whole tissue data by summarizing their expression into “cell type profiles” that can be thought of as surrogates for the relative abundance of the cell types across the samples studied. Further analysis of our cell type-specific expression database confirms some recent findings about brain cell types along with revealing novel properties, such as Ddc expression in oligodendrocytes. To facilitate further use of this expanding database, we provide a user-friendly web interface for the visualization of expression data.


Introduction
Brain cells can be classified based on features such as their primary type (e.g. neurons vs. glia), location (e.g. cortex, hippocampus, cerebellum), electrophysiological properties (e.g. fast spiking vs. regular spiking), morphology (e.g. pyramidal cells, granule cells) or the neurotransmitter/neuromodulator they release (e.g. dopaminergic cells, serotonergic cells, GABAergic cells) (Petilla Interneuron Nomenclature Group et al., 2008). Marker genes, genes that are expressed in a specific subset of cells, are often used in combination with other cellular features to define different types of cells (Hu et al., 2014;Margolis et al., 2006) and facilitate their characterization by tagging the cells of interest for further studies (Handley et al., 2015;Lobo et al., 2006;Tomomura et al., 2001). Marker genes have also found use in the analysis of whole tissue "bulk" gene expression profiling data, which are challenging to interpret due to the difficulty of determining the source cell type(s) of an observed expression change. For example, a decrease in a transcript can indicate a regulatory change in the expression level of the gene, a decrease in the number of cells expressing the gene, or both. To address this issue, computational methods have been proposed to estimate cell-type specific proportion changes based on expression patterns of known marker genes (Chikina et al., 2015;Westra et al., 2015;Xu et al., 2013). Finally, marker genes are obvious candidates for having cell-type specific functional roles.
An ideal cell type marker has a strongly enriched expression in a single cell type in the brain.
However, this criterion can rarely be met, and for many purposes, cell type markers can be defined within the context of a certain brain region; thus, a useful marker may be specific for the cell type in one region but not necessarily in another region or brain-wide. For example, the calcium binding protein parvalbumin is a useful marker of both fast spiking interneurons in the cortex and Purkinje cells in the cerebellum (Celio and Heizmann, 1981;Kawaguchi et al., 1987). Whether the markers are defined brain-wide or in a region-specific context, the confidence in their specificity is established by testing their expression in as many different cell types as possible. This is important because a marker identified by comparing just two cell types might turn out to be expressed in a third, untested cell type, reducing its utility.
During the last decade, targeted cell specific purification followed by gene expression profiling has been applied to many cell types in the brain. Such studies, targeted towards well-characterized cell types, have greatly promoted our understanding of the functional and molecular diversity of these cells (Cahoy et al., 2008;Chung et al., 2005;Doyle et al., 2008). However, individual studies of this kind are limited in their ability to discover specific markers as they often analyse only a small subset of cell types (Shrestha et al., 2015;Okaty et al., 2009;Sugino et al., 2006) or have limited resolution as they group subtypes of cells together (Cahoy et al., 2008). Recently, advances in technology have enabled the use of single cell transcriptomics as a powerful tool to dissect neuronal diversity and derive novel molecular classifications of cells (Poulin et al., 2016). However, with single cell analysis the classification of cells to different types is generally done post-hoc, based on the clustering similarity in their gene expression patterns. These molecularly defined cell types are often uncharacterized otherwise (e.g. electrophysiologically, morphologically), challenging their identification outside of the original study and understanding their role in normal and pathological brain function. Regardless of the methodology used, technical biases and artefacts introduced during different stages of cellular purification (e.g. stress) and analysis (e.g. normalization) are known to have an impact on the obtained expression profiles (Handley et al., 2015;Okaty et al., 2011;Poulin et al., 2016). These effects can obscure true differences between two cell types or introduce disparities resulting from differences in cell state rather than cell type.
Thus, individual studies may not be ideal for identification of cell type markers. We hypothesized that by aggregating cell-type specific studies, a more comprehensive data set more suitable for marker genes could be derived.
Here we report the analysis of an aggregated cross-laboratory data set of cell-type specific expression profiling experiments from mouse brain. We use this data set to identify sets of brain cell marker genes more comprehensive than any previously reported. Using homology relations, we also show how the markers can be applicable to human brain cell types. We validate the markers in external datasets and demonstrate their potential usage in analysis of whole tissue data via the summarization of marker gene expression into "marker gene profiles" (MGPs), which can be cautiously interpreted as correlates of cell type proportion. We have made the cell type expression profiles and marker sets available to the research community at neuroexpresso.org. Figure 1A depicts the workflow and the major steps of this study. All analysis was performed in R version 3.1.2; the R code and data files are available from the authors.

Collection of cell type specific data sets:
We began with a collection of seven studies of isolated cell types from the brain, compiled by Okaty et. al. (2011). We expanded this by querying PubMed (http://www.ncbi.nlm.nih.gov/pubmed) and Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/) ( Barrett et al., 2013;Edgar et al., 2002) for cell type-specific expression datasets from the mouse brain that used Mouse Expression 430A Array (GPL339) or Mouse Genome 430 2.0 Array (GPL1261) platforms.
These platforms were our focus as together, they are the most popular platform for analysis of mouse samples and are relatively comprehensive in gene coverage, and using a reduced range of platforms reduced technical issues in combining studies. Query terms included names of specific cell types (e.g. astrocytes, pyramidal cells) along with blanket terms such as "brain cell expression" and "purified brain cells". Only samples derived from postnatal (> 14 days), wild type, untreated animals were included. Datasets obtained from cell cultures or cell lines were excluded due to the reported expression differences between cultured cells and primary cells (Cahoy et al., 2008;Halliwell, 2003;Januszyk et al., 2015)). We considered the use of single cell RNA sequencing data (Darmanis et al., 2015;Habib et al., 2016;Lake et al., 2016;Tasic et al., 2016;Zeisel et al., 2015).
However, so far, these studies have been confined to a small number of brain regions and establishing the correspondence between the cell types as classified in these studies and the commonly defined cell types in the pooled cell data is not trivial. We thus leave this as a matter for future work. For the present study we instead used single cell data for validation (described below).
As a first step in the quality control of the data, we manually validated that each sample expressed the gene that was used as a marker for the purification step (expression greater than median expression among all gene signals in the dataset), along with other well established marker genes for the relevant cell type (eg. Pcp2 for Purkinje cells, Gad1 for GABAergic interneurons). We next excluded contaminated samples, namely, samples expressing established marker genes of nonrelated cell types in levels comparable to the cell type marker itself (for example neuronal samples expressing high levels of glial marker genes), which lead to the removal of 21 samples. The final database contains 30 major cell types compiled from 24 studies (summarized in Error! Reference source not found.; a complete list of all samples including those removed is available from the authors).
Grouping and re-assignment of cell-type samples: When possible, samples were assigned to specific cell types based on the descriptions provided in their associated original publications. When expression profiles of closely related cell types were too similar to each other and we could not find differentiating marker genes meeting our criteria, they were grouped together into a single cell type ((e.g. A10 and A9 dopaminergic cells were grouped as "dopaminergic neurons").
Because our focus was on finding markers specific to cell types within a single brain region, samples were grouped based on the brain region from which they were isolated, guided by the anatomical hierarchy of brain regions ( Figure 1B). Brain subregions (e.g. locus coeruleus) were added to the hierarchy if there were multiple cell types represented in the subregion. An exception to the region assignment process is glial samples. Since these samples were only available only for cortex or cerebellum regions or extracted from whole brain, the following assignments were made: Cerebral cortex-derived astrocyte and oligodendrocyte samples were included in the analysis of other cerebral regions as well as thalamus, brainstem and spinal cord. Bergmann glia and cerebellum-derived oligodendrocytes were used in the analysis of cerebellum. The only microglia samples available had been isolated from whole brain homogenates and were included in the analysis of all brain regions.

Selection of cell type markers:
Cell type-enriched genes were selected for each brain region, based on fold change and clustering quality (see below). Since samples in the database originate from multiple independent studies, using different purification methods, they are expected to have study-specific effects that could potentially affect the gene selection process. The fold change between cell type of interest and the rest of the dataset was determined by taking the the median expression of all replicates and averaging the resulting values per cell type. The quality of clustering was determined by the mean silhouette coefficient (a measure of group dissimilarity ranged between -1 and 1 (Rousseeuw, 1987)), assigning all samples from the same cell type to one cluster and all remaining cell types to another. Silhouette coefficients were calculated with the "silhouette" function from the "cluster" R package version 1.15.3 (Maechler et al., 2016), using the expression difference of the gene between samples as the distance metric.
Based on these metrics, genes were selected for each brain region using the following criteria (this process was run on each gene):  A threshold expression level of 8 was selected to help ensure that the gene's transcripts will be detectable in bulk tissue. Theoretically, if a gene has an expression level of 8 in a cell type, and the gene is specific to the cell type, an expression level of 6 would be observed if 1/8 th of a bulk tissue is composed of the cell type. As many of the cell types in the database are likely to be as rare as or rarer than 1/8 th , and 6 is generally close to background for these data, we picked 8 as a lower level of marker gene expression.  If the log2 expression level in the cell type is above 9.5, there must be at least a 10-fold difference from the median expression level of the rest of the cells in the region.
 If log2 expression level in the cell type is below 9.5, the expression level must be higher than the expression level of every other cell type in that region. This condition was added because at 9.5 log2 expression level, for a 10-fold expression change to occur, the expression median should be lower than ~6 which is the median expression level of all genes in the dataset. Values below the median often represent the background signal that do not convey meaningful information but can prevent potentially useful marker genes from being selected.
 The mean silhouette coefficient for the gene must be higher than 0.5 for the associated cell type.
 The conditions above must be satisfied only by a single cell type in the region.
To prevent individual large studies from dominating the silhouette coefficient, we used the following randomization procedure, repeated 500 times. If studies representing same cell types did not have an equal number of samples, from each study, N samples are picked randomly. N being the number of samples in the smallest study that represent the same cell type. One-third (rounded up) of the remaining samples were randomly removed to ensure robustness against outlier samples resulting in the removal of 40%-65% of the samples depending on the brain region. A gene was retained as a marker if it satisfied the above criteria in more than 95% of the 500 resamplings.

Microglia enriched genes:
Microglia expression profiles differ significantly between activated and inactivated states and to our knowledge, the samples in our database represent only the inactive state (Holtman et al., 2015). In order to acquire marker genes with stable expression levels regardless of microglia activation state, we removed the genes differentially expressed in activated microglia based on Holtman et al. (2015). This step resulted in removal of 359 out of the original 629 microglial genes. S100a10 + pyramidal cell enriched genes: The paper (Schmidt et al., 2012) describing the cortical S100a10 + pyramidal cells emphasizes the existence of non-neuronal cells expressing S100a10 + . Schmidt et al. therefore limited their analysis to 7853 genes specifically expressed in neurons and advised third-party users of the data to do so as well. In compliance, we removed the selected marker genes for S100a10 + pyramidal cells if they were not in the list of genes provided in Schmidt et al. (2012). Of note, we also removed S100a10 itself since based on the author's description it was not specific to this cell type. In total, 40 of the 51 S100a10 pyramidal genes were removed in this manner.

In situ hybridization
Male C57BL/6J mice aged 13-15 weeks at time of sacrifice were used (n=5). Mice were euthanized by cervical dislocation and then the brain was quickly removed, frozen on dry ice, and stored at -80°C until sectioned via cryostat. Brain sections containing the sensorimotor cortex were cut along the rostral-caudal axis using a block advance of 14 μm, immediately mounted on glass slides and dried at room temperature (RT) for 10 minutes, and then stored at -80°C until processed using multi-label fluorescent in situ hybridization procedures.
Fluorescent in situ hybridization probes were designed by Advanced Cell Diagnostics, Inc.
(Hayward, CA, USA) to detect mRNA encoding Cox6a2, Slc32a1, and Pvalb. Two sections per animal were processed using the RNAscope® 2.5 Assay as previously described (Wang et al., 2012). Briefly, tissue sections were incubated in a protease treatment for 30 minutes at RT and then the probes were hybridized to their target mRNAs for 2 hours at 40°C. The sections were exposed to a series of incubations at 40°C that amplifies the target probes, and then counterstained with NeuroTrace blue-fluorescent Nissl stain (1:50; Molecular Probes) for 20 minutes at RT. Cox6a2, Pvalb, and Slc32a1 were detected with Alexa Fluor® 488, Atto 550 and Atto 647, respectively.
Data were collected on an Olympus IX83 inverted microscope equipped with a Hamamatsu Orca-Flash4.0 V2 digital CMOS camera using a 60x 1.40 NA SC oil immersion objective. The equipment was controlled by cellSens (Olympus). 3D image stacks (2D images successively captured at intervals separated by 0.25 μm in the z-dimension) that are 1434 x 1434 pixels (155.35 μm x 155.35 μm) were acquired over the entire thickness of the tissue section. The stacks were collected using optimal exposure settings (i.e., those that yielded the greatest dynamic range with no saturated pixels), with differences in exposures normalized before analyses.
Laminar boundaries of the sensorimotor cortex were determined by cytoarchitectonic criteria using NeuroTrace labeling. Fifteen image stacks across the gray matter area spanning from layer 2 to 6 were systematic randomly sampled using a sampling grid of 220 x 220 μm 2 , which yielded a total of 30 image stacks per animal. Every NeuroTrace labeled neuron within a 700 x 700 pixels counting frame was included for analyses; the counting frame was placed in the center of each image to ensure that the entire NeuroTrace labeled neuron was in the field of view. The percentage (± standard deviation) of NeuroTrace labeled cells containing Cox6a2 mRNA (Cox6a2+) and that did not contain Slc32a1 mRNA (Slc32a1-), that contained Slc32a1 but not Pvalb mRNA (Slc32a1+/Pvalb-), and that contained both Slc32a1 and Pvalb mRNAs (Slc32a1+/Pvalb+) were manually assessed.

Allen Brain Atlas in situ hybridization (ISH) data:
We downloaded in situ hybridization (ISH) images using the Allen Brain Atlas API (http://help.brainmap.org/display/mousebrain/API). Assessment of expression patterns was done by visual inspection.

Single cell analysis:
Mouse cortex single cell RNA sequencing data were acquired from Zeisel et al. (2015)  pre-processed expression data were encoded in a binary matrix with 1 representing any nonzero value. For all marker gene sets, Spearman's ρ was used to quantify internal correlation. A null distribution was estimated by calculating the internal correlation of 1000 randomly-selected prevalence-matched gene groups. Gene prevalence was defined as the total number of cells with a non-zero expression value for the gene. Prevalence matching was done by choosing a random gene with a prevalence of +/-2.5% of the prevalence of the marker gene. P-values were calculated by comparing the internal correlation of marker gene set to the internal correlations of random gene groups using Wilcoxon rank-sum test.

Pre-processing of microarray data:
All microarray data used in the study were pre-processed and normalized with the "rma" function of the "oligo" or "affy" (Carvalho and Irizarry, 2010) R packages. Probeset to gene annotations were obtained from Gemma (Zoubarev et al., 2012) (http://chibi.ubc.ca/microannots). Any probeset with maximal expression level lower than the median among all probeset signals was removed. Of the remaining probesets that represent the same gene, the one with the highest variance among different cell types was selected for further analysis.
Cell type specific samples that make up the NeuroExpresso database were processed by a modified version of the "rma" function that enabled collective processing of data from Mouse

Estimation of marker gene profiles (MGPs):
For each cell type relevant to the brain region analysed, we used the first principal component of the corresponding marker gene set expression as a surrogate for cell type proportions. Principal component analysis was performed using the "prcomp" function from the "stats" R package, using the "scale = TRUE" option. Since the sign of the loadings of the rotation matrix is arbitrary, we multiplied the loadings by -1 if a majority of the loadings were negative. It is plausible that some marker genes will be transcriptionally regulated under different conditions (e.g. disease state), reducing the correspondence between their expression level with the relative cell proportion. A gene that is regulated will have reduced correlation to the other marker genes with expression levels primarily dictated by cell type proportions which will reduce their loading in the first principal component. To reduce the impact of regulated genes on the estimation process, we removed marker genes from a given analysis if their loadings were negative when calculated based on all samples in the dataset. For visualization purposes, the scores were normalized to the range 0-1.
Two sided Wilcoxon rank-sum test ("wilcox.test" function from the "stats" package in R, default options) was used to compare between the different experimental conditions. For estimations of cell type MGPs in samples from frontal cortex and white matter from the Trabzuni et al., (2013) study, results were subjected to multiple testing correction by the Benjamini & Hochberg method (Benjamini and Hochberg, 1995). For SMRI collection of psychiatric patients we estimated oligodendrocytes MGPs based on expression data available through SMRI website (as indicated above) and compared our results to experimental cell counts from the same cohort of subjects previously reported by Uranova et al. (2004). Figure 5B representing the oligodendrocyte cell counts in each disease group was adapted from Uranova et al. (2004). The data to re-create the figure was extracted using WebPlotDigitizer (http://arohatgi.info/WebPlotDigitizer/app/) by Ankit Rohatgi.

Compilation of a brain cell type expression database
A key input to our search for marker genes is expression data from purified brain cell types.
Expanding on work from Okaty et al., (2011), we assembled and curated a database of cell typespecific expression profiles from published data (see Methods). The database ("NeuroExpresso") represents 30 major cell types from 12 brain regions (Sample count -number of samples that representing the cell type; Gene count ) and a total of 263 samples from 23 published studies, and includes the expression profiles of 11509 genes. We used rigorous quality control steps to identify contaminated samples and outliers (see Methods). All cell types with the exception of ependymal cells are represented by at least 3 replicates and 8/30 cell types are represented by multiple independent studies (Error! Reference source not found.). The database is in constant growth as more cell type data becomes available. As more RNAseq data from pooled and single brain cell types will become available, these data will be added to our database. To facilitate access to the data and allow basic analysis we provide a simple search and visualization interface on the web, www.neuroexpresso.org ( Figure 1C). The app provides means of visualising gene expression in different brain regions based on the cell type, study or methodology, as well as differential expression analysis between groups of selected samples.

Identification of cell type enriched marker gene sets
We used the NeuroExpresso data to identify marker gene sets (MGS) for each of the 30 cell types.
An individual MGS is composed of genes highly enriched in a cell type in the context of a brain region (Error! Reference source not found.A). Marker genes were selected based on a) fold of change relative to other cell types in the brain region and b) a lack of overlap of expression levels in other cell types (see Methods for details). This approach captured previously known marker genes (e.g. TH for dopaminergic cells (Pickel et al., 1976), Tmem119 for microglia (Bennett et al., 2016)) along with numerous new candidate markers such as Cox6a2 for fast spiking parvalbumin (PV) + interneurons. Some commonly used marker genes, as well as marker genes previously reported by individual studies included in the database, were not selected by our analysis. For example, we found that the commonly used oligodendrocyte marker Olig1 is also highly expressed in the NeuroExpresso astrocyte samples, and thus it was not selected as an oligodendrocyte marker ( Figure 2B). Similarly, Fam114a1 (9130005N14Rik), identified as a marker of fast spiking basket cells by Sugino et al. (2006), is also highly expressed in oligodendrocytes and a subtype of pyramidal cells ( Figure 2B). These cell types were not available in the Sugino et al. (2006) study, and thus the lack of specificity of Fam114a1 could not be observed by the authors. In total we identified 1149 marker genes, with 2-140 markers per cell type (Error! Reference source not found.). The next sections focus on verification and validation of our proposed markers, using multiple methodologies.

Verification of markers by in-situ hybridization
Two cell types in our database (Purkinje cells of the cerebellum and hippocampal dentate gyrus granule cells) are organized in well-defined anatomical structures readily identified in tissue sections. We exploited this fact to use in-situ hybridization data from the Allen Brain Atlas (http://mouse.brain-map.org) (Sunkin et al., 2013) to verify co-localization of known and novel markers for these two cell types. There was general agreement (19/26 of dentate granule cell markers 27/38 of Purkinje cell markers) that the markers were correctly localized to the corresponding brain structures. Figure 2C shows representative examples for the two cell types.
We independently verified Cox6a2 as a marker of cortical fast spiking PV+ interneurons using triple label in situ hybridization of mouse cortical sections for Cox6a2, Pvalb and Slc32a1 (a pan-GABAergic neuronal marker) transcripts. As expected, we found that approximately 25% of all

Verification of marker gene sets in single-cell RNAseq data
As a further independent validation of our marker gene signatures, we analysed their properties in recently published single cell RNAseq datasets derived from mouse cortex Zeisel et al., 2015) and human cortex (Darmanis et al., 2015). In these studies, the authors used post hoc clustering to propose novel cell types (Poulin et al., 2016). We could not directly compare our MGSs to markers of these cluster-defined cell types because their correspondence to the cell types in NeuroExpresso was not clear. However, since these studies analyse a large number of cells, they are likely to include cells that correspond to the cortical cell types in our database. Thus if our MGSs are cell type specific, and the corresponding cells are present in the single cell data sets, MGS should have a higher than random rate of being co-detected in the same cells, relative to non-marker genes. A weakness of this approach is that a failure to observe a correlation might be due to absence of the cell type in the data set rather than a true shortcoming of the markers.
Overall, all MGSs were successfully validated in at least one of the single cell datasets (p<0.05; Error! Reference source not found.). More MGSs were significantly coexpressed in mouse single cell datasets  MGSs) than in the human dataset . While this difference might indicate inter-specie differences, it might be merely an outcome of the substantially smaller number of cells in Darmanis et al dataset, where less abundant cell types may not have been adequately sampled (Table 2).

Neuroexpresso as a tool for understanding the biological diversity and similarity of brain cells
One of the applications of NeuroExpresso is an exploratory tool for exposing functional and biological properties of cell types. For example, high expression of genes involved in GABA synthesis and release (Gad1, Gad2 and Slc32a1) in forebrain cholinergic neurons suggests the capability of these cells to release GABA in addition to their cognate neurotransmitter acetylcholine ( Figure 4A). Indeed, co-release of GABA and acetylcholine from forebrain cholinergic cells was recently demonstrated by Saunders et al. (2015). Similarly, the expression of the glutamate transporter Slc17a6, observed in thalamic (habenular) cholinergic cells suggests co-release of glutamate and acetylcholine from these cells, supported by experimental findings by Ren et al. (2011)) ( Figure 4A).
A closer analysis of the data revealed interesting expression patterns suggesting previously unknown characteristics of several cell types. For example, one of the novel oligodendrocyte markers identified by our analysis is Ddc (encoding dopa decarboxylase, an enzyme involved in biosynthesis of monoamines) ( Figure 4B). Expression of Ddc in oligodendrocytes is consistently high in three studies using different purification techniques and selection markers ( Figure 4B), reducing the possibility that this finding results from contamination or a technical artefact. We also observed an unexpected bimodality of gene expression patterns of midbrain dopaminergic cells taken from two studies using similar purification method ( Figure 4C), suggesting that the studies represent different sub-populations of this cell type.
Lastly, we found overlap between several markers of spinal cord and brainstem cholinergic cells, and midbrain noradrenergic cells, suggesting previously unknown functional similarity between cholinergic and noradrenergic cell types. The common markers included Chodl, Calca, Cda and Hspb8, shown to be expressed in brainstem cholinergic cells (Enjin et al., 2010) and Phox2b, a known marker of noradrenergic cells (Pattyn et al., 1997).

Marker Gene Profiles can be used to infer changes in cellular proportions in the brain
Marker genes are by definition cell type specific, and thus, changes in their expression observed in bulk tissue data might represent either changes in the number of cells or cell type specific transcriptional changes (or a combination). Expression profiles of marker genes (summarized as the first principal component of their expression, see Methods), can be treated as a surrogate for relative proportion of the corresponding cell type across the samples (Chikina et al., 2015;Westra et al., 2015;Xu et al., 2013). Marker genes of four major classes of brain cell types (namely neurons, astrocytes, oligodendrocytes and microglia) were previously used to gain cell type specific information from brain bulk tissue data (Bowling et al., 2016;Kuhn et al., 2011;Sibille et al., 2008;Skene and Grant, 2016;Tan et al., 2013b). Importantly, since marker genes are likely to correspond to the functional identities of the cell types, it is plausible for their expression to be conserved across different species.
In order to validate the use of MGPs as surrogates for relative cell type proportions, we used bulk tissue expression data from conditions with known changes in cellular proportions. Though we were not able to find a single human brain expression study with complementary cell count data, we were able to work with datasets with known cell type proportion differences.
Firstly, we calculated MGPs for human white matter and frontal cortex using data collected by Trabzuni et al., (2013). Comparing the MGPs in white vs. grey matter, we observed the expected increase in oligodendrocyte MGP, as well as increase in astrocyte and microglia MGPs, corroborating previously reported higher number of these cell types in white vs. grey matter (Ogura et al., 1994;Williams et al., 2013). We also observed decrease in MGPs of all neurons, corroborating the common knowledge of decreased neuronal density in white vs. grey matter ( Figure 5A).
We then identified a gene expression dataset from brains of psychiatric patients (study 2 from SMRI microarray database, see methods section) and a separate study involving experimental cell counts of oligodendrocytes from similar brain region in the same cohort of subjects (Uranova et al., 2004). We thus calculated oligodendrocyte MGPs based on the expression data and compared the results to experimental cell counts from Uranova et. al. (2004). Our results demonstrate high similarity between oligodendrocyte MGPs and experimental cell counts at a group level ( Figure   5B). Direct comparison between MGP and experimental cell count at a subject level was not possible, as Uranova et al. did not provide the subject identities corresponding to each of the cell count values.
To further assess and demonstrate the ability of MGPs to correctly represent cell type specific changes in neurological conditions, we calculated dopaminergic profiles of substantia nigra samples in two data sets of Parkinson's disease (PD) patients and controls from Moran et al. (2006) (GSE8397) and Lesnick et al. (2007) (GSE7621). We tested whether the well-known loss of dopaminergic cells in PD could be detected using our MGP approach. As expected, in both datasets MGP analysis correctly identified reduction in dopaminergic cells in substantia nigra of Parkinson's disease patients ( Figure 5C). Moran et al. (2006) identified 22 genes consistently differentially expressed in PD patients in their study and in earlier study of PD patients (Zhang et al., 2005), and suggested that these genes can be considered as a "PD expression signature". We hypothesized that differential expression of some of these genes might be better explained by a decrease in dopaminergic cells rather than intracellular regulatory changes. In order to address this issue we performed a principal component analysis of the 22 genes emphasized in Moran et al. (2006)

Cell type specific expression database as a resource for neuroscience
We present NeuroExpresso, a rigorously curated database of cell type specific gene expression data from pooled cell types (www.neuroexpresso.org). To our knowledge, this is the most comprehensive database of brain cell type expression data. The database will be expanded as more data become available and in the future will integrate RNAseq data from pooled and single cell samples.
NeuroExpresso allows simultaneous examination of gene expression numerous cell types across different brain regions. This approach promotes discovery of cellular properties that might have otherwise been unnoticed or overlooked when using gene-by-gene approaches or pathway enrichment analysis. For example, a simple examination of expression of genes involved in biosynthesis and secretion of GABA and glutamate, suggested the co-release of these neurotransmitters from forebrain and habenular cholinergic cells, respectively.
Studies that aim to identify novel properties of cell types can benefit from our database as an inexpensive and convenient way to seek novel patterns of gene expression. For instance, our database shows significant bimodality of gene expression in dopaminergic cell types from the midbrain ( Figure 4C). The observed bimodality might indicate heterogeneity in the dopaminergic cell population, which could prove a fruitful avenue for future investigation. Another interesting finding from NeuroExpresso is the previously unknown overlap of several markers of motor cholinergic and noradrenergic cells (Error! Reference source not found.). While the overlapping markers were previously shown to be expressed in spinal cholinergic cells, to our knowledge their expression in noradrenergic (as well as brain stem cholinergic) cells was previously unknown.
NeuroExpresso can be also used to facilitate interpretation of genomics and transcriptomics studies. Recently (Pantazatos et al., 2016) used an early release of the databases to interpret expression patterns in the cortex of suicide victims, suggesting involvement of microglia. Moreover, this database has further applications beyond the use of marker genes, such as interpreting other features of brain cell diversity, like in electrophysiological profiles (Tripathy et al., in preparation).
Importantly, NeuroExpresso is a cross-laboratory database. A consistent result observed across several studies raises the certainty that it represents a true biological finding rather than merely an artefact or contamination with other cell types. This is specifically important for findings such as the expression of Ddc in oligodendrocytes ( Figure 4B). This surprising result is suggestive of a previously unknown ability of oligodendrocytes to produce monoamine neurotransmitters upon exposure to appropriate precursor, as previously reported for several populations of cells in the brain (Ren et al., 2016;Ugrumov, 2013). Alternatively, this finding might indicate a previously unknown function of Ddc.

Validation of cell type markers
In situ hybridization results that we generated and those made available by the Allen Institute for Brain Sciences (Sunkin et al., 2013) validated a subset of our results (Cox6a2 as a marker of fast spiking basket cells and Purkinje and dentate granule cell markers). Further validation was performed with computational methods in independent single cell datasets from mouse and human. This analysis revealed some limitations in generalizing mouse markers identified in mouse cell types to a human context. The differences between mouse and human datasets might indicate evolutionary changes in the function of the homologous genes or functional differences between the human and mouse cell types (Shay et al., 2013;Zhang et al., 2016). However, since most MGSs did validate between mouse and human data, it suggests that most marker genes preserve their specificity despite cross-species gene expression differences. An additional caveat with using single cell data sets for validation is that rare cell types may not have been adequately sampled. In there is no substantial difference in the total number of cells between the two mouse datasets, the cre-line based sampling approach used by Tasic et al was designed to target relatively rare neuron types . As sample sizes for these single cells datasets continue to grow, and in particular those from human brain tissue, we expect that it will be easier to directly resolve this potential inconsistency.

Estimation of marker gene profiles in bulk tissue
Marker genes can assist with the interpretation of bulk tissue data in the form of marker gene profiles (MGPs). A parsimonious interpretation of a change in an MGP is a change in the relative abundance of the corresponding cell type, when compared to other samples. Because our approach focuses on the overall trend of MGS expression levels, it should be relatively insensitive to within-cell-type expression changes for a subset of genes. Still, we prefer to refer to "MGP expression" rather than "cell type proportions".
Our results show that MGPs based on NeuroExpresso marker gene sets (MGSs) can reliably recapitulate relative changes in cell type abundance across different conditions. Direct validation of cell count estimation based on MGSs in human brain was not feasible due to the unavailability of cell counts coupled with expression data. Instead, we compared oligodendrocyte MGPs based on a gene expression dataset available through the SMRI database to experimental cell counts taken from a separate study (Uranova et al., 2004) of the same cohort of subjects. While we were unable to match the samples one by one between the two studies, we show high similarity between oligodendrocyte MGPs and cell counts across the experimental groups.
Based on analysis of dopaminergic MGPs we were able to capture the well-known reduction in dopaminergic cell types in PD patients. Moreover, we show that multiple genes previously reported as differentially expressed in PD are highly correlated with dopaminergic MGPs in both control and PD subjects. This high correlation suggests that the observed differential expression of these genes might merely reflect changes in dopaminergic cell populations rather than PD related regulatory changes

Limitations and caveats
The NeuroExpresso database was assembled from multiple datasets, based on different mouse strains and cell type extraction methodologies. We attempted to address some of the inter-study variability by combined pre-processing of the raw data and quantile normalization of the probesets for each sample in the database. However, due to insufficient overlap between cell types represented by different studies, many of the potential confounding factors such as age, sex and methodology could not be explicitly corrected for. Thus, it is likely that some of the expression values in NeuroExpresso may be affected by confounding factors. While our confidence in the data is increased when expression signals are robust across multiple studies, many of the cell types in NeuroExpresso are represented by a single study. Hence, we advise that small differences in expression between cell types as well as previously unknown expression patterns based on a single data source should be treated with caution. In our analyses, we address these issues by enforcing a stringent set of criteria for the marker selection process, reducing the impact of outlier samples and ignoring small changes in gene expression.
An additional limitation of our study is that the representation for many of the brain cell types is still lacking in the NeuroExpresso database. Therefore, despite our considerable efforts to ensure cell type-specificity of the marker genes, we cannot rule out the possibility that some of them are also expressed in one or more of the non-represented cell types.
In summary, we believe that NeuroExpresso is an extremely valuable resource for neuroscientists.
We identified numerous novel markers for 30 major cell types and used them to estimate cell type profiles in bulk tissue data, demonstrating high correlation between our estimates and experimentbased cell counts. This approach can be used to reveal cell type specific changes in whole tissue samples and to re-evaluate previous analyses on brain whole tissues that might be biased by cell type-specific changes. Information about cell type-specific changes is likely to be very valuable since conditions like neuron death, inflammation, and astrogliosis are common hallmarks of in neurological diseases.