Gray Matter Features of Reading Disability: A Combined Meta-Analytic and Direct Analysis Approach1234

Abstract Meta-analysis of voxel-based morphometry dyslexia studies and direct analysis of 293 reading disability and control cases from six different research sites were performed to characterize defining gray matter features of reading disability. These analyses demonstrated consistently lower gray matter volume in left posterior superior temporal sulcus/middle temporal gyrus regions and left orbitofrontal gyrus/pars orbitalis regions. Gray matter volume within both of these regions significantly predicted individual variation in reading comprehension after correcting for multiple comparisons. These regional gray matter differences were observed across published studies and in the multisite dataset after controlling for potential age and gender effects, and despite increased anatomical variance in the reading disability group, but were not significant after controlling for total gray matter volume. Thus, the orbitofrontal and posterior superior temporal sulcus gray matter findings are relatively reliable effects that appear to be dependent on cases with low total gray matter volume. The results are considered in the context of genetics studies linking orbitofrontal and superior temporal sulcus regions to alleles that confer risk for reading disability.


Introduction
Reading disability affects at least 5-15% of the U.S. population (Shaywitz et al., 1990;Katusic et al., 2001;Yoshimasu et al., 2010) for which there are varied patterns of oral and written language problems (Torppa et al., 2007;Archibald et al., 2013;McArthur et al., 2013). There has been progress toward understanding the neuroanatomical bases for reading disability (Vandermosten et al., 2012;Richlan et al. 2013), but small sample sizes and uncontrolled demographic factors limit interpretation of reading disability findings. Meta-analysis and direct analysis of large multisite datasets can help to move beyond these limitations to advance our understanding of reading disability.
Meta-analyses of findings from voxel-based gray matter studies of reading disability (Linkersdörfer et al., 2012;Richlan et al., 2013) suggest that people with reading disability exhibit lower gray matter volume at the posterior end of the Sylvian fissure, the fusiform gyrus, and cerebellar regions where gross volumetric and/or atypical asymmetry effects have also been observed (Leonard et al., 2001;Rae et al., 2002;Fernandez et al., 2013). Each of these brain regions has been linked to reading and/or reading disability in functional imaging studies (Maisog et al., 2008;Paulesu et al., 2014) and are targets for understanding reading disability.
Integrating existing data across research sites can address the challenges noted above by increasing statistical power, but can be limited when investigators use different recruitment strategies. Children with dyslexia are typically included in studies if they exhibit impaired phonological processing based on pseudoword decoding and real word reading measures (Fletcher, 2009). Impairments in these core reading skills are also likely in children with specific language impairment or oral and written language learning disability (OWL LD) who exhibit problems in listening comprehension, reading comprehension, oral expression, and/or written expression (Silliman and Berninger, 2011). Distinguishing behavioral features of OWL LD are not always measured (Pennington and Bishop, 2009) and this increases the likelihood that reading disability samples include children with OWL LD and dyslexia, especially when participants are recruited from clinical settings (Catts et al., 2005).
Children with OWL LD often have below average verbal comprehension, (Silliman and Berninger, 2011) and low brain volume (specific language impairment; Leonard et al., 2006;Girbau-Massana et al., 2014). Low brain volume in children with low verbal comprehension appears be expressed before children learn to read and differentiates children with and without receptive language impairment within a reading disability sample (Kibby et al., 2009). In addition, there is evidence of below average head circumference at birth in children with specific language impairment compared to controls (Whitehouse et al., 2012) and a positive association between infant variation in brain size and childhood intelligence (Gale et al., 2006). Together, these findings suggest that verbal comprehension and brain volume can be used to characterize evidence for OWL LD in reading disability samples.
The current multisite study included data from research sites where verbal comprehension was measured so that the impact of verbal comprehension and brain volume on reading disability group differences could be examined. We addressed this and the other methodological concerns noted above to test the hypothesis that brain regions supporting reading are atypical in children with reading disability (Richlan, 2012). An integrated metaanalytic and direct voxel-based analysis of multisite data provided increased statistical power with the goal of establishing reliable predictors of reading disability that are targets for understanding the etiologies and treatment of reading disability.

Voxel-based meta-analysis
A Signed Differential Mapping (SDM v4.13; Radua et al., 2012) meta-analysis was performed to identify consistent voxel-based gray matter effects across reading disability studies (Table 1). This analysis included 58 peak coordinates that were weighted by the statistical thresholds used to identify the reported effects from comparisons of 218 control and 244 dyslexia cases across eleven publications (Brown et al., 2001;Brambati et al., 2004;Eckert et al., 2005;Silani et al., 2005;Vinckenbosch et al., 2005;Kronbichler et al., 2008;Menghini et al., 2008;Steinbrink et al., 2008;Raschle et al., 2011;Jednoróg et al., 2013;Tamboer et al., 2015). These papers were identified based on PubMed searches for dyslexia and reading disability voxel-based gray matter studies published through the end of April 2015 and included a study of 5-to 6-year-old 3 Please see www.dyslexiadata.org for more information about and contributors to the Dyslexia Data Consortium. children who were and were not at risk for dyslexia (Raschle et al., 2011). Studies involving cases with impaired learning of a logographic written language system were excluded. Studies by Hoeft et al. (2007) and Krafnick et al. (2014) were also excluded from the meta-analysis because the principal investigator from these studies contributed data to this project. Excluding these two studies from the meta-analysis ensured that the meta-and direct data analyses described below were performed with completely independent datasets. SDM was used because the method incorporates the effect sizes from studies in the meta-analysis. The statistical thresholds, determined based on the reported statistics in each study, ranged from p Ͻ 0.05 to p Ͻ 0.00004 (uncorrected) for the result with the smallest effect size reported in each study. Conversion between reported t scores, p values, and Z-scores was performed using the SDM "Statistics converter" functions (http://www.sdmproject.com/utilities) for all studies except one (Steinbrink et al., 2008) that used a false discovery rate (FDR)corrected threshold and did not report group difference statistics. We assumed a large effect size for this study (p Ͻ 0.0001) given the relatively small sample size (n ϭ 16) and significant effects for the FDR p Ͻ 0.05 threshold.
The meta-analysis included 500 permutations and a default 20 mm isotropic kernel parameter. These methods are consistent with those used by Richlan et al. (2013), who used a p Ͻ 0.005 uncorrected statistical threshold for defining consistently significant effects across studies. The results below are presented with p Ͻ 0.005 and p Ͻ 0.001 uncorrected thresholds, with the p Ͻ 0.001 threshold used to define regions-of-interest (ROIs) for the direct analysis of the multisite sample data.

Multisite samples for gray matter analyses
Data from six neuroimaging studies designed to examine reading disability and control group differences in brain structure and/or function were collected as part of a larger study on developing methods for multisite studies. Identification of the reading disability cases across contributing sites included recruitment from clinical settings, schools specializing in dyslexia remediation, and the local community. Institutional Review Board approval was obtained to receive each de-identified dataset at the Medical University of South Carolina and approval for sharing de-identified data were obtained at the institution of each contributing site. The research presented here was performed in accordance with the provisions of the Declaration of Helsinki. Summary statistics for demographic and reading performance information within and across research sites are presented in Table 2.

Multisite behavioral assessment
Children in each sample were administered real word identification (Letter-Word Identification), pseudoword decoding (Word Attack), and reading comprehension (Passage Comprehension cloze task) subtests from the Woodcock-Johnson IIIR (WJIII; 3 sites; Woodcock et al., 2001) and Woodcock Reading Mastery Tests (3 sites; Woodcock, 1987). Verbal comprehension was assessed using the Wechsler Intelligence Scales for Children (WISC; 1 site; Wechsler, 2004) and the Wechsler Abbreviated Scales of Intelligence (2 sites; Wechsler, 1999). The WISC Vocabulary subtest was administered at five sites. The WISC Similarities subtest was administered at four sites. The WISC Information and Comprehension subtests were administered at two sites. Finally, the Verbal Comprehension score was obtained at one site from an off-site neuropsychologist who referred the children to the study but did not provide subtest scores like Information, Vocabulary, Similarities, and Comprehension that contribute to the overall Verbal Comprehension Factor or Index Score.
There were differences in mean performance between sites for the standardized behavioral scores that appear to reflect site differences (Table 2), rather than test version. For example, word identification performance was significantly different between reading disability groups from two sites where the WJIII was administered (t (63) ϭ 4.67, p Ͻ 0.001). The pattern and severity of impaired reading skills therefore appeared to differ across sites. There was, however, consistently lower real word reading scores than pseudoword reading among reading disability cases across sites (t (160) ϭ Ϫ10.36, p Ͻ 0.001). This difference in real word relative to pseudoword reading impairments may indicate the presence of children with OWL LD in the samples (Berninger, 2008;Berninger and May, 2011, their  Thresholds are reported as uncorrected thresholds based on the reported t or Z-scores, except for^three studies for which only the statistical threshold was reported. Ages were rounded or ϳ estimated from the mean age of multiple groups or reported as a range if mean ages were not provided. Table 2) and motivated the analyses designed to control for potential OWL LD effects. Table 3 presents the acquisition parameters that were used to collect T1-weighted images for 306 cases across the six sites. The VBM8 Toolbox image covariance function (http://dbm.neuro.uni-jena.de/vbm/download) and visual inspection were used to identify and exclude 13 cases with poor image quality, resulting in 293 cases for analysis. Each of the images was rigidly transformed into anterior commissure-posterior commissure alignment. The images were then bias field-corrected using SPM8 and denoised using a non-local means filter that estimated and removed Rician noise in the images (Manjón et al., 2010). Probabilities for gray matter, white matter, and CSF in each voxel were estimated using the SPM8 New Segment algorithm. The gray matter probability images were spatially transformed into a study-specific normalized space using a diffeomorphic normalization procedure (SPM, DARTEL default settings;Ashburner, 2007). Each image was modulated to adjust for volumetric change during normalization and smoothed with an 8 mm Gaussian kernel. The normalized, modulated, and smoothed images were used to: (1) obtain average gray matter volume from within the meta-analysis ROI, and (2) perform exploratory voxel-based gray matter analyses. In summary, these methods were chosen to be consistent with commonly used voxel-based morphometry protocols (Kurth et al., 2015).

Multisite structural imaging data
The meta-analysis results were transformed from MNI coordinate space into the study-specific space of the DARTEL template using the default nonlinear normalization procedure in SPM. Specifically, the SPM a priori gray matter image was normalized to the DARTEL gray matter template and the normalization parameters were then applied to the SDM meta-analysis results image. MRI-CRon (http://www.mccauslandcenter.sc.edu/mricro/mricron) was then used to define the space of each SDM cluster or ROI in order to collect each case's average smoothed gray matter volume from within each ROI using MarsBaR (Brett et al., 2002). These meta-analysis and image processing steps generated 3 gray matter volume ROI variables that were correlated with the 4 behavioral measures and reading groups after controlling for age, age-squared, gender, site, and total gray matter volume. RD, reading disability; VC, verbal comprehension; WID, word identification; WA, word attack; PC, passage comprehension. †This site had complete verbal comprehension missingness. Younger controls were recruited to have reading level matches for the reading impaired cases and this produced a group difference in age (p Ͻ 0.05) that remained significant after discriminant function assignment of cases to reading groups. Gender was missing for 1 case. ‡ This site had complete passage comprehension missingness. Significant group differences were observed for all variables between reading disability and control groups (p Ͻ 0.001), with the exception of age (n.s.). There were no significant differences in the distribution of gender within sites or across the multisite sample before or after discriminant function analysis selection of cases. The standardized scores are age-normed values. The estimate of total gray matter volume was derived by summing the gray matter probability values from each participant's native space segmented gray matter image. Different scanners and acquisition protocols can differentially affect the gray to white matter contrast in T1weighted images and can therefore affect image processing and voxel-based results. We controlled for differences in T1-weighted contrast across research sites by using research site dummy variables in the analyses described below (Walhovd et al., 2011). These dummy variables together accounted for 95% of the variance for the gray to white matter ratio (GWR) that was collected from the denoised and bias field-corrected T1-weighted images. The GWR measure was obtained by normalizing the original images into the coordinate space of the studyspecific template using the DARTEL normalization parameters, using MarsBaR to collect and average the gray or white matter values from across regions that had at least 50% gray or white matter probabilities, and then dividing these two values. We chose to use a dummy variable to represent each site in the statistical analyses described below rather than the GWR measure because the dummy variables could also capture additional potential differences between the sites that were not observable (eg, recruitment strategies).

Statistical analyses Missingness
Multiple imputation (Rubin, 1996;Parker and Schenker, 2007;Vaden et al., 2012) was used to deal with behavioral data missingness within and across sites (percentage missingness across the multisite sample: Word Attack, 1%; Word Identification, 1%; Passage Comprehension, 14%; Verbal Comprehension, 25%). It is important to deal with missingness rather than use an available case analysis (only cases with data for all variables) because complete case analysis can produce biased results (Rubin, 1996). Specifically, available case analysis produces false negative results (eg, simulation results shown by Vaden et al., 2012, their Fig.5).
Multiple imputation was developed for dealing with missing data that are conditional on the observed data or missing at random (MAR; Little and Rubin, 2002). Although it is difficult to prove MAR, predictors of missingness from the observed data can provide support for the assumption of MAR in the use of multiple imputation with the additional assumption that missingness does not depend on unobserved data. We identified multiple predictors of missingness in our data. For example, children with higher Passage Comprehension scores were more likely to have missing Verbal Comprehension data (r ϭ 0.23, p ϭ 0.0003) and children with higher Word Identification, Word Attack, and Verbal Comprehension were more likely to have missing Passage Comprehension scores (r ϭ 0.17, p ϭ 0.004; r ϭ 0.14, p Ͻ 0.043; r ϭ 0.14, p ϭ 0.043, respectively). Given these associations, we also reasoned that Verbal Comprehension and Passage Comprehension missingness were not because of an intellectual or physical inability to perform the Verbal Comprehension tests [a condition for missing not at random (MNAR)] because the children were able to perform the reading-related tests. Therefore, missingness appeared to be MAR rather than MNAR and appropriately addressed using multiple imputation (Rubin, 1996).
Strong predictors of the observed data were also identified and used to inform the multiple imputation. For example, 78% of the variance in the observed Passage Comprehension and 50% of the variance in the observed Verbal Comprehension data were explained by variables in the multiple imputation model listed below. Monte Carlo Markov Chain imputation (Schafer, 1997) with predictive mean matching was performed with SPSS (v22) using variables that: (1) explained variance in the observed data for the variables with missingness, (2) were used in planned analyses, and (3) were used in potential control analyses to ensure the validity of pooled results from 10 multiply imputed datasets (Raghunathan et al., 2001). Pooling of results from the 10 imputed datasets was performed using the standard principle of combining point and variance estimates from imputed data (Little and Rubin, 2002). The point estimates are averages of the 10-point estimates from analysis of each imputed dataset, whereas the variance estimates account for both within and between imputation variability. Finally, the multiple imputation model included the four behavioral measures described above, dummy variables for site, age [and a quadratic age (age 2 ) variable because of nonlinear agerelated changes in regional gray matter volume], reading disability group, gender, a gender by reading disability interaction, and total gray matter volume.

Sample heterogeneity: discriminant function analysis
Retrospective multisite samples are likely to include groups of heterogeneous cases because of differences in sampling across sites and/or definitions of reading disability, as described in the Introduction. For each multiply imputed dataset, discriminant function analysis was performed using the Word Attack, Word Identification, and Passage Comprehension scores to identify cases that were not consistently classified as reading disability or control cases. The correctly classified cases were used for comparing the gray matter measures between the reading groups.

Control for OWL LD
Ideally, measures of expressive and receptive language would be used to control for OWL LD (or specific language impairment), but these data were not available for this retrospective dataset. We considered other defining features of OWL LD to identify and exclude these cases from the gray matter comparisons between reading disability groups. Specifically, we used measures of verbal comprehension and total gray matter volume to identify possible OWL LD cases based on evidence of significantly lower verbal comprehension (Snowling et al., 2000;Silliman and Berninger, 2011;Girbau-Massana et al., 2014) and significantly lower total gray matter volume in OWL LD cases compared with controls (Leonard et al., 2002;Girbau-Massana et al., 2014). The rationale for using these variables to identify potential cases of OWL LD is further supported by evidence that children with reading disability and lower receptive language performance are more likely to have significantly lower-left and right cerebral hemisphere volumes compared to children with reading disability with higher receptive language function (Kibby et al., 2009). A brain size effect on verbal comprehension in children with OWL LD appears to be present early in life (Whitehouse et al., 2012), but we considered the possibility that some children can exhibit low brain volume and have relatively normal oral and written language skills. We also considered evidence that poor reading skills can lead to lower verbal comprehension in older children who must read to learn (Ramsden et al., 2013). Thus, children classified as OWL LD in the current study had to have low values for both verbal comprehension and total gray matter volume.
The classification of children with OWL LD was based on the mean group differences reported in the OWL LD literature. In particular, children with OWL LD exhibit verbal comprehension scores ϳ1 SD below the mean of the population (ϳ16 th percentile; Snowling et al., 2000;Silliman and Berninger, 2011;Girbau-Massana et al., 2014) and brain volume measures ϳ.67 SD below the mean of control cases (ϳ25 th percentile of the sample; Leonard et al., 2002;Girbau-Massana et al., 2014). Thus, cases below the 16 th percentile in verbal comprehension and below the 25 th percentile in total gray matter volume for the sample were classified as having possible OWL LD (n ϭ 10) and excluded from group comparisons in control analyses. We used 25 th , 30 th , and 37 th percentile thresholds for both variables to determine the extent to which the ROI results were affected by excluding cases for these more liberal thresholds for defining OWL LD (n ϭ 12, 15, and 21 respectively). Although this approach may not have identified every case with OWL LD, it helped to characterize the extent to which group differences in the gray matter data could be attributed to oral and written language problems when these cases were removed from the group comparisons.

ROI gray matter comparisons
The three ROI gray matter variables identified from the meta-analysis were first correlated with the behavioral data to determine the extent to which they exhibited significant associations across the multisite sample (n ϭ 293). For example, we examined the extent to which the average gray matter volume in a left superior temporal sulcus region from the meta-analysis was significantly predictive of Passage Comprehension scores. Thus, we used a dimensional approach to examine the extent to which the broad range of reading scores related to the ROI measures across the entire dataset before and after accounting for reading group. As described above, the ROI gray matter volumes were also compared between reading groups to determine the extent to which ROI gray matter group differences from the meta-analysis results were also observed in the multisite data. These group comparisons did not assume homogeneity of variance as increased variance was observed in the reading disability group compared to control group. This analysis included only cases identified as control or reading disabled in the discriminant function analysis. This typical group difference design therefore included relatively more homogenous reading groups that were clearly distinct in their reading skills. Bonferroni correction was used to control for multiple comparisons for correlation and group difference analyses.

Voxel-based gray matter group differences
Voxel-based morphometry was performed to determine the extent to which there were reading disability group differences in gray matter volume across brain regions that were not represented by the meta-analysis ROI. The reading groups identified with the discriminant function analysis were used for this analysis.
The Levene test for homogeneity of variance was first used to examine the extent to which there were reading disability group differences in gray matter variance that could impact the likelihood of observing group differences in gray matter means between the reading groups. Levene tests were performed for each gray matter voxel within a 20% probability mask using R (library "car"). SD maps also are shown below to demonstrate regions of high variance before and after accounting for nuisance covariates. This analysis was designed to determine the extent to which increased anatomical variance occurred in brain regions that are inconsistently reported in gray matter studies of dyslexia and that might impact the likelihood of observing group differences.
Reading disability group differences in voxel-based gray matter volume were examined while controlling for age (and the age 2 variable), gender, and research site. The age 2 variable was included to account for nonlinear changes in gray matter that can occur during childhood (Wilke et al., 2007).
The default settings for group comparisons in SPM assume non-homogeneity of variance, and thus the group comparisons were appropriate for gray matter regions exhibiting group differences in gray matter variance. All analyses were limited to voxels with at least a 20% probability of gray matter across cases based on the DARTEL study-specific gray matter template. Voxel-based results from the 10 imputed datasets were pooled, as described in the multiple imputation section, to obtain group results. Statistical significance for these more exploratory analyses was defined with a familywise error-corrected, p Ͻ 0.05 peak voxel threshold. We did explore the possibility of gender effects using a gender by reading disability interaction term based on gender findings in dyslexia gray matter studies (Altarelli et al., 2013Evans et al., 2014), but these results were not significant and are not described further. Figure 1 presents the varied spatial locations of gray matter group differences across the 11 studies included in the meta-analysis. Significantly lower gray matter volume in reading disability compared with control cases was observed in left orbitofrontal cortex/pars orbitalis of the inferior frontal gyrus/(MNI: Ϫ38, 36, Ϫ14; Z ϭ 2.49), left posterior superior temporal sulcus/middle temporal gyrus (MNI: Ϫ56, Ϫ56, 8; Z ϭ 2.74), and right cerebellar hemisphere (MNI: 24, Ϫ68, Ϫ42; Z ϭ 2.36) regions across studies. There were no regions exhibiting significantly greater gray matter volume in the reading disability compared with controls across studies. There were too few studies to obtain reliable associations between varied methods or sampling approaches across studies (eg, gender ratio) and the varied locations of reported effects.

Associations between ROI gray matter and behavior
There was a wide range of reading skills among the reading disability and control cases, as demonstrated in Table 2, which motivated a dimensional data analysis approach. Table 4 shows that the left orbitofrontal, left  Jednoróg et al. (2015) was not included in the meta-analysis and did not observe cortical effects for p Ͻ 0.001 peak and p Ͻ 0.05 cluster extent thresholds. The yellow clusters were normalized into the study-specific DARTEL space of the multisite data to extract average gray matter volume estimates for the dimensional and group difference analyses. posterior superior temporal sulcus, and right cerebellar hemisphere ROI exhibited significant associations with passage comprehension (r ϭ 0.17 to r ϭ 0.20) and survived Bonferroni correction for 12 comparisons (p ϭ 0.004 to p ϭ 0.0006) after controlling for gender, age, age 2 , and research site. These passage comprehension results were nearly identical when excluding cases with multiplyimputed passage comprehension data (eg, left superior temporal sulcus ϫ passage comprehension: r ϭ 0.20, p ϭ 0.001). These associations were diminished when controlling for reading group (Passage Comprehension by left orbitofrontal cortex, partial r ϭ 0.14, p ϭ 0.019, by left superior temporal sulcus, partial r ϭ 0.11, p ϭ 0.054, and right cerebellar hemisphere, partial r ϭ 0.13, p Ͻ 0.028). The inclusion of reading disability cases with relatively high Passage Comprehension scores for their group explained why these structure-function associations were not completely explained by reading group.

Reading group ROI gray matter differences
Again, discriminant function analysis was used to establish behaviorally distinct reading groups using the Passage Comprehension, Word Identification, Word Attack, and Verbal Comprehension scores for the group comparisons. An average of 7.06% reading disability cases (n ϭ 11.6) were incorrectly classified as controls. An average of 17.36% control cases (n ϭ 22.4) were incorrectly classified as reading disabled across the 10 imputed datasets. This analysis yielded 150 reading disability and 105 control cases for the gray matter comparisons that exhibited significant group differences (all p Ͻ 0.001) in Word Attack, Word Identification, Passage Comprehension, and Verbal Comprehension ( Table 2). The control group exhibited significantly more gray matter volume than the reading disability group for the left orbitofrontal (t (246.9) ϭ 2.70, p ϭ 0.007; Cohen's d ϭ 0.34) and left superior temporal sulcus (t (248.8) ϭ 2.86, p ϭ 0.005; Cohen's d ϭ 0.37) ROI after Bonferroni correction for three comparisons (p Ͻ 0.017; Fig. 2), but not for the right cerebellar hemisphere ROI (t (242.2) ϭ 1.79, p ϭ 0.075; Cohen's d ϭ 0.24). These comparisons were performed without assuming equal variances because there was increased variance in the reading disability group compared with the control group for these three variables (Levene test: p ϭ 0.071, p ϭ 0.017, p ϭ 0.063, for each region, respectively).
We then examined the extent to which children with evidence of OWL LD accounted for the reading disability group differences in orbitofrontal cortex (OFC) and superior temporal sulcus (STS) gray matter volume. Table 5 Figure 2. Reading disability group differences in gray matter within meta-analysis ROI (n ‫؍‬ 255). Significantly lower gray matter volume [adjusted for age, age 2 , gender, and site (resid)] was observed in reading disability cases compared to controls within left OFC (including pars orbitalis), and left STS. There were not significant reading disability group differences in right cerebellar hemisphere (CrbH) ROI gray matter volume, perhaps because of the increased gray matter volume variance in the reading disability group. The ROI group differences in gray matter volume were statistically dependent on total gray matter volume. The 15 th /25 th percentile thresholds for classification of OWL LD were used for verbal comprehension and total gray matter volume, respectively. These thresholds were chosen based on the size of group differences reported in the literature for those variables. Otherwise, the same percentile threshold was used for both variables. The Cohen's d values were calculated based on the mean and SD of the ROI gray matter volumes for each reading disability group.
shows a small change in Cohen's d effect size for the ROI group differences with the exclusion of potential OWL LD cases. Moreover, there was relatively limited impact of more liberal classification criteria for OWL LD despite the decreasing sample size (Table 5). These results show that the OFC and STS group differences required a large sample size given the small effect sizes. These results also suggest that OWL LD cases can contribute to group differences for these gray matter measures rather than obscure or drive the group differences. Finally and most importantly, an ANCOVA demonstrated that there were no significant reading disability group differences in OFC and STS gray matter volume when controlling for total gray matter volume (OFC group: F (1,254) ϭ 0.83, n.s.; STS group: F (1,254) ϭ 0.16, n.s.). This result was consistent with the significant difference in total gray matter volume between reading groups (F (1,254) ϭ 10.41, p ϭ 0.001; Cohen's d ϭ 0.43). Thus, the reading group differences in OFC and STS gray matter volume were because of cases with low total gray matter volumes irrespective of cases that also had low verbal comprehension.

Voxel-based gray matter comparisons
Voxelwise Levene tests of the homogeneity of variance indicated significantly greater variance in the reading disability group compared to the control group, before and after accounting for total gray matter volume. This increased variance in the reading disability group includes superior temporal, supramarginal, inferotemporal, occipital, and cerebellar regions that have been implicated in dyslexia. Figure 3 shows these variance effects and demonstrates relatively more pronounced gray matter variance in the reading disability group compared to the control group. This increased variance is important when considering the magnitude of the group differences in voxel-based gray matter described below.
Exploratory voxel-based comparisons demonstrated widespread lower gray matter volume estimates in reading disabled compared with control cases for an uncorrected p Ͻ 0.001 threshold when controlling for gender, age, age 2 , and research site. Figure 4 shows that these results overlap the space of the left orbitofrontal and superior temporal sulcus meta-analysis ROI. There were no results that survived familywise error correction despite the relatively large sample size. In addition, removing cases with evidence of OWL LD (Ͻ16 th percentile verbal comprehension and Ͻ25 th percentile total gray matter volume) did not uncover any new results that had been obscured by the OWL LD cases. Finally, there were no voxel-based group differences at the p Ͻ 0.001 uncor- rected threshold when the total gray matter volume variable was included in the model. These results are consistent with results in Tables 4 and 5, which demonstrate a strong influence of total gray matter volume on the multisite results.

Discussion
Reading disability and poor cloze reading comprehension occurs with low gray matter volume in left orbitofrontal gyrus/pars orbitalis and left posterior superior temporal sulcus/middle temporal gyrus regions compared with typical readers based on a meta-analysis of the extant voxelbased reading disability literature and direct analysis of multisite data. These effects were present despite reading disability group differences in gray matter variance. These effects also were present when controlling for research site, age, and gender, but do require large sample sizes and children with reading disability who have relatively low total gray matter volume. These total gray matter volume-dependent results appear to reflect a primary source of reading difficulty that rises above other sources of atypical development among children with a complex disorder that can have multiple etiologies.

Varied behavioral profiles and sampling
Reading disability samples are often collected with varied sampling approaches that can yield samples with quite different behavioral profiles and therefore impact study results. For example, Eicher et al. (2014) observed relatively consistent genetic effects of DYX2 (KIAA0319) alleles on reading, language, and IQ measures, but the strength and pattern of associations varied across four samples. This sampling and behavioral heterogeneity issue could explain why so many different structural findings have been reported with limited replication (Fig. 1). The research site differences in real word reading among children with reading disability in the current study further supports the premise that children with quite different reading profiles are included in neuroimaging studies of dyslexia.
The current study included a discriminant function analysis of the behavioral data to select cases for control and reading disability groups that were clearly different in their reading skill abilities (Table 2). Despite these attempts to control for heterogeneity, the reading disability group was still composed of cases with quite varied behavioral (Table 2) and anatomical (Fig. 3) profiles. Nonetheless, reading disability group differences in left orbitofrontal and left posterior superior temporal sulcus gray matter volume were observed. These findings appear to be relatively robust compared with findings implicating other brain regions in reading disability. Moreover, the results are consistent with genetic evidence that KIAA0319 alleles confer risk for impaired reading and language functions across different samples .

Potential genetic explanations for the gray matter findings
Left orbitofrontal/pars opercularis finding The KIAA0319 minor risk allele for SNP rs9461045 has been associated with cortical thickness in left orbitofrontal cortex across a normative sample of 322 subjects that ranged from 3 to 22 years of age (Eicher et al., 2015). This result is consistent with the meta-analysis and direct data analysis findings from the current study that people with reading disability have lower left orbitofrontal gray matter volume. In addition, KIAA0319 SNP polymorphisms (rs2038136 and rs2038137) have been linked to the functional coherence of an orbitofrontal/pars orbitalis/inferior frontal sulcus and parietal network (Jamadar et al., 2013). Together, these results suggest a direct link between The multisite voxel-based reading group comparison results (cyan) are presented with the meta-analysis results (yellow) from voxel-based dyslexia studies on the study specific DARTEL gray matter template (all clusters p Ͻ 0.001, uncorrected). The multisite group comparison included covariates for age, age 2 , gender, and research site. The left hemisphere overlap of results (red) from each analysis falls within the left posterior superior temporal sulcus and left orbitofrontal cortex. All cyan clusters were no longer present using the p Ͻ 0.001 uncorrected threshold when total gray matter volume was included as a covariate. atypical development of orbitofrontal/pars orbitalis function and structure that limits reading development.
There is a possibility that despite these genetic associations, the orbitofrontal findings are indirectly related to reading disability. The findings described here could be explained, at least in part, by comorbid attention deficit (Sexton et al., 2012) and/or the social emotional consequences of having a reading disability (Martínez and Semrud-Clikeman, 2004;Snowling et al., 2007) on orbitofrontal cortex (van't Ent et al., 2007;Fernández-Jaén et al., 2014). In addition, externalizing behaviors impact the likelihood that teachers refer children with learning disabilities for special education services (Lloyd et al., 1991) and externalizing behaviors occur with relatively lower cortical thickness in left orbitofrontal cortex (Ameis et al., 2014). Early childhood studies may help to differentiate these potential direct and/or indirect effects on orbitofrontal development.

Left posterior superior temporal sulcus finding
The gray matter reading disability group differences in the left posterior superior temporal sulcus are consistent with the premise that reading disability occurs in children with atypical development of posterior temporal regions (Gabrieli, 2009) that support phonologic processing (Price, 2012). Moreover, the posterior superior temporal sulcus is a target for arcuate fasciculus projections where a DCDC2 allele (rs793842), which confers risk for dyslexia (Meng et al., 2005), occurs with lower temporoparietal white matter volume within a fiber distribution that appears to terminate in the posterior superior temporal sulcus (Darki et al., 2012, their Fig. 3C). Atypical white matter and the DCDC2 risk allele have also been observed in cases with low left middle temporal gyrus cortical thickness (rs793842; Darki et al., 2014). Together, these findings suggest that DCDC2 risk alleles contributed to the superior temporal sulcus findings in the current study.

Additive multigenic variation
The left posterior superior temporal sulcus and orbitofrontal effects both appeared to be dependent on total brain volume in the current study. This observation is consistent with the location of ectopic neurons in orbitofrontal and superior temporal regions, as well as an apparent negative association between low brain weight and the frequency of ectopias in the small sample of postmortem brains from reading disabled males, who were studied by Galaburda et al. (1985). An additional spatial correspondence with the current results, as well as and the central sulcus region in Figure 4, is that these regions show pronounced rates of growth in gestation during a critical period of cortical folding (20 -28 weeks; Rajagopalan et al., 2011) when the maximal number of cortical neurons is reached (de Courten-Myers et al., 1996). Thus, it is reasonable to hypothesize that genes associated with dyslexia and migrational error in animal models, such as KIAA0319 (Platt et al., 2013) and DCDC2 (Meng et al., 2005), contribute to orbitofrontal and superior temporal sulcus findings in people with reading disability.
The orbitofrontal/pars orbitalis and superior temporal sulcus results also are aligned with the diffusion imaging literature implicating arcuate fibers of the superior longitudinal fasciculus in dyslexia . Migrational errors in regions that are connected by the superior longitudinal fasciculus would be expected to produce atypical diffusional properties because fibers have lost their targets and/or the patterns of projection are less organized. In support of this premise, a KIAA0319 risk allele (rs6935076) has been associated with lower temporoparietal white matter volume (Darki et al., 2014). Moreover, spatial variance in the location of migrational errors that leads to atypical structure in a large tract could explain why diffusion-imaging findings from dyslexia studies have been relatively more consistent than gray matter findings.
The KIAA0319 and DCDC2 findings described above suggest that together these genes impact the development of orbitofrontal and superior temporal regions and may have additive effects. Interestingly, reading skills appear to be particularly impaired when KIAA0319 and DCDC2 risk haplotypes are both observed (Powers et al., 2013). This observation is generally consistent with the observation from the current study that cases with the poorest Passage Comprehension scores had relatively low gray matter volumes. Given the relatively low frequency of people having both KIAA0319 and DCDC2 risk haplotypes (see www.deeveybee.blogspot.org, June 2013), a small number of cases with pronounced atypical development would likely contribute to reading group gray matter differences. This observation is consistent with the increased variance in the reading disability sample that appeared to be dependent, in part, on low total gray matter volume in a subset of reading disability cases. Moreover, this would suggest that widespread and pronounced migrational errors are necessary to observe gross morphological effects that are spatially specific using normalized space voxel-based approaches and that rise above a statistical threshold in anatomically heterogeneous samples.

Inconsistencies in the meta-analysis and direct analysis results
The left superior temporal sulcus and left orbitofrontal cortex results were spatially overlapping across the metaanalysis and direct data analysis results. While there was a trend for group differences in right cerebellar hemisphere gray matter volume, the direct data analyses did not strongly replicate cerebellar findings from dyslexia studies that have been reported with relative consistency (Pernet et al., 2009;Stoodley, 2014) and that are demonstrated by the meta-analysis results (Fig. 1). One explanation for these results is the increased gray matter variance across cerebellar regions in the reading disability compared with the control group (Figs. 2, 3). This variance result replicates an observation that cerebellar gray matter measures are more extreme in dyslexic compared with control cases (Pernet et al., 2009) and suggests that anatomical variance impacts the likelihood of observing cerebellar effects in reading disability studies.
Increased gray matter variance in the reading disability group compared to control group was observed not only in the cerebellum but also across the brain. Left and right inferotemporal regions exhibited increased variance in the reading disability group, even after controlling for the impact of total gray matter on these group differences (Fig. 3). This is an important observation because reports of inferotemporal gray matter differences (Linkersdörfer et al., 2012) could be inconsistent because of increased anatomical variance in reading disability samples. We did observe reading disability group differences in inferotemporal cortex (p Ͻ 0.001, uncorrected) before controlling for total gray matter volume, which suggests that inferotemporal group differences are at least partially dependent on the inclusion of cases with low total gray matter volume.
The variance results suggest the possibility that anatomical variance is a defining feature of reading disability because of dysregulated cortical patterning or varied location of migrational errors. Perhaps more simply, the variance findings may reflect the etiological heterogeneity of reading disability. For example, Männel et al. (2015) have reported that children with dyslexia who have a dyslexia risk allele in the TNFRSF1B gene also have lower verbal working memory and higher left posterior STS gray matter volume. The risk allele(s) that some children with dyslexia carry therefore appears to differentially impact the amount of STS gray matter volume and would lead to increased anatomical variance.
Differential genetic effects on gray matter volume (eg, DCDC2 vs TNFRSF1B) would reduce replication across dyslexia studies, perhaps as demonstrated by the varied spatial locations of reported effects in Figure 1. This possibility is important when considering results from the meta-analysis localizer approach that was used in this study. For example, Linkersdörfer et al. (2012) reported meta-analysis results that dyslexia cases were more likely to have lower right SMG gray matter compared with controls. We also observed this effect when studies from two of the contributing sites for this project were included in the meta-analysis. This right SMG result (and an inferotemporal result discussed above) were no longer present when excluding these two studies to ensure that the direct data analysis results were independent of the metaanalysis results. Although the left STS, left OFC, and right cerebellar effects were insensitive to the inclusion or exclusion of these two studies, it is possible that there are other brain regions that are important for understanding reading disability and that were not included in our metaanalysis results. This idea might be supported by the widespread voxel-based results shown in Figure 4, but all of these effects (p Ͻ 0.001, uncorrected) were diminished after controlling for total gray matter volume.
There were some regions where controls exhibited increased gray matter variance compared to the reading disability cases (eg, left posterior end of the Sylvian fissure), but the vast majority of Levene test differences were driven by increased variance in the reading disability group. This observation is consistent with observations from Leonard et al. (2006) and Pernet et al. (2009) suggesting that we should focus on the anatomical variance in reading disability samples when con-sidering etiology for a complex disorder and when thinking about identifying children who are at risk for reading disability. This variance perspective could be particularly important for studying cases with dyslexia who have migrational errors that may not always fall in the same voxels across brain images (see the varied spatial position of ectopias and dysplasias from Galaburda et al., 1985, their Figs. 1, 4, 5, 6).

Conclusions
There is consistent, but modestly lower gray matter volume in left orbitofrontal cortex/pars orbitalis and left posterior superior temporal sulcus/middle temporal gyrus regions in people with reading disability based on meta-analysis and direct data analysis of multisite data. These findings are further supported by evidence of reduced cortical thickness across these same regions in children with dyslexia (Clark et al., 2014), but would not have been reported in this study given the uncorrected p Ͻ 0.001 effects for the direct data multisite analysis or if total gray matter volume was controlled (Jednoróg et al., 2015) without the meta-analysis results to support these effects. The local gray matter volume findings reported here appear to be dependent on total gray matter volume effects that increased the anatomical variance among children with reading disability. These total brain volume and variance results are perhaps observed because of widespread migrational errors that are concentrated in perisylvian regions (Galaburda et al., 1985) and/or because of atypical arcuate fasciculus development Wandell and Yeatman, 2013). Together, the results suggest that the brain volume effects associated with the left superior temporal sulcus and orbitofrontal gyrus development have the potential to isolate primary etiologies for reading disability that we predict include effects related to KIAA0319 and DCDC2 risk alleles.