Mine #1 | Selecting models without noticing it | Be aware of the assumptions behind analysis methods; treat the choice among different algorithms as a model selection problem | Chari et al. (2021) |
Mine #2 | Overfitting with overly complex models | Use statistical model selection tools which penalize too many parameters | Polynomial fitting |
Mine #3 | Selecting from a pool of poorly fitting models might lead to false confidence | Simulate data from each of the tested models multiple times and test whether the real data are sufficient to distinguish across the competing models | Figure 1b |
Mine #4 | Different information criteria might favor different models | Consider the strengths and limitations of the different approaches (Table 2); simulated data can be used to test which model selection method is the most reliable for the given problem | Figure 1c (AIC favors overfitting), e (BIC chooses an oversimplified model), f; Evans (2019) |
Mine #5 | Model selection might be sensitive to parameters ignored by the tested models | Avoid model classes that are too restrictive to account for data heterogeneity | Chandrasekaran et al. (2018) |
Mine #6 | Cross-validation techniques are prone to overfitting | A data splitting approach was proposed by Genkin and Engel in which optimal model complexity is determined by calculating KL divergence | Genkin and Engel (2020) |
Mine #7 | Agglomerative hierarchical clustering is sensitive to outliers | Consider divisive methods | Figure 2c; Varshavsky et al. (2008) |
Mine #8 | K-means clustering might converge to local minima | Repeat several times from different starting centroid locations | Figure 2e, right |
Mine #9 | Number of clusters not known | Use the elbow method, gap statistics, or model selection approaches | Figure 2e, left |