Table 1

List of common “mines” of model selection and clustering discussed in the paper


IssueSuggestionExample
Mine #1Selecting models without noticing itBe aware of the assumptions behind analysis methods; treat the choice among different algorithms as a model selection problemChari et al. (2021)
Mine #2Overfitting with overly complex modelsUse statistical model selection tools which penalize too many parametersPolynomial fitting
Mine #3Selecting from a pool of poorly fitting models might lead to false confidenceSimulate data from each of the tested models multiple times and test whether the real data are sufficient to distinguish across the competing modelsFigure 1b
Mine #4Different information criteria might favor different modelsConsider the strengths and limitations of the different approaches (Table 2); simulated data can be used to test which model selection method is the most reliable for the given problemFigure 1c (AIC favors overfitting), e (BIC chooses an oversimplified model), f; Evans (2019)
Mine #5Model selection might be sensitive to parameters ignored by the tested modelsAvoid model classes that are too restrictive to account for data heterogeneityChandrasekaran et al. (2018)
Mine #6Cross-validation techniques are prone to overfittingA data splitting approach was proposed by Genkin and Engel in which optimal model complexity is determined by calculating KL divergenceGenkin and Engel (2020)
Mine #7Agglomerative hierarchical clustering is sensitive to outliersConsider divisive methodsFigure 2c; Varshavsky et al. (2008)
Mine #8K-means clustering might converge to local minimaRepeat several times from different starting centroid locationsFigure 2e, right
Mine #9Number of clusters not knownUse the elbow method, gap statistics, or model selection approachesFigure 2e, left