Abstract
A key challenge in machine learning is to design interpretable
models that can reduce their inputs to the best subset for
making transparent predictions, especially in the clinical
domain. In this work, we propose a certifiably optimal feature
selection procedure for logistic regression from a mixed-integer
conic optimization perspective that can take an auxiliary cost
to obtain features into account. Based on an extensive review of
the literature, we carefully create a synthetic dataset
generator for clinical prognostic model research. This allows us
to systematically evaluate different heuristic and optimal
cardinality- and budget-constrained feature selection
procedures. The analysis shows key limitations of the methods
for the low-data regime and when confronted with label noise.
Our paper not only provides empirical recommendations for
suitable methods and dataset designs, but also paves the way for
future research in the area of meta-learning.
Users
Please
log in to take part in the discussion (add own reviews or comments).