Cross-validation pitfalls when selecting and assessing regression and classification models

Damjan Krstajic; Ljubomir J Buturovic; David E Leahy; Simon Thomas

doi:10.1186/1758-2946-6-10

Cross-validation pitfalls when selecting and assessing regression and classification models

J Cheminform. 2014 Mar 29;6(1):10. doi: 10.1186/1758-2946-6-10.

Authors

Damjan Krstajic^{1

2

3}, Ljubomir J Buturovic⁴, David E Leahy⁵, Simon Thomas⁶

Affiliations

¹ Research Centre for Cheminformatics, Jasenova 7, 11030, Beograd, Serbia. damjan.krstajic@rcc.org.rs.
² Laboratory for Molecular Biomedicine, Institute of Molecular Genetics and Genetic Engineering, University of Belgrade, Vojvode Stepe 444a, 11010, Beograd, Serbia. damjan.krstajic@rcc.org.rs.
³ Clinical Persona Inc, 932 Mouton Circle, East Palo Alto, CA, 94303, USA. damjan.krstajic@rcc.org.rs.
⁴ Clinical Persona Inc, 932 Mouton Circle, East Palo Alto, CA, 94303, USA.
⁵ Molplex Pharmaceuticals, Alderly Park, Macclesfield, SK10 4TF, UK.
⁶ Cyprotex Discovery Ltd, 15 Beech Lane, Macclesfield, SK10 2DR, UK.

Abstract

Background: We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches.

Methods: We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case.

Results: We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models.

Conclusions: We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.