The Missing Indicator Method: From Low to High Dimensions
November 16, 2022 ยท Declared Dead ยท ๐ Knowledge Discovery and Data Mining
"No code URL or promise found in abstract"
Evidence collected by the PWNC Scanner
Authors
Mike Van Ness, Tomas M. Bosschieter, Roberto Halpin-Gregorio, Madeleine Udell
arXiv ID
2211.09259
Category
cs.LG: Machine Learning
Cross-listed
cs.AI,
stat.ML
Citations
22
Venue
Knowledge Discovery and Data Mining
Last Checked
4 months ago
Abstract
Missing data is common in applied data science, particularly for tabular data sets found in healthcare, social sciences, and natural sciences. Most supervised learning methods only work on complete data, thus requiring preprocessing such as missing value imputation to work on incomplete data sets. However, imputation alone does not encode useful information about the missing values themselves. For data sets with informative missing patterns, the Missing Indicator Method (MIM), which adds indicator variables to indicate the missing pattern, can be used in conjunction with imputation to improve model performance. While commonly used in data science, MIM is surprisingly understudied from an empirical and especially theoretical perspective. In this paper, we show empirically and theoretically that MIM improves performance for informative missing values, and we prove that MIM does not hurt linear models asymptotically for uninformative missing values. Additionally, we find that for high-dimensional data sets with many uninformative indicators, MIM can induce model overfitting and thus test performance. To address this issue, we introduce Selective MIM (SMIM), a novel MIM extension that adds missing indicators only for features that have informative missing patterns. We show empirically that SMIM performs at least as well as MIM in general, and improves MIM for high-dimensional data. Lastly, to demonstrate the utility of MIM on real-world data science tasks, we demonstrate the effectiveness of MIM and SMIM on clinical tasks generated from the MIMIC-III database of electronic health records.
Community Contributions
Found the code? Know the venue? Think something is wrong? Let us know!
๐ Similar Papers
In the same crypt โ Machine Learning
๐ฎ
๐ฎ
The Ethereal
๐ฎ
๐ฎ
The Ethereal
Continuous control with deep reinforcement learning
๐
๐
Old Age
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
๐
๐
Old Age
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor
๐
๐
Old Age
SGDR: Stochastic Gradient Descent with Warm Restarts
๐ฎ
๐ฎ
The Ethereal
Asynchronous Methods for Deep Reinforcement Learning
Died the same way โ ๐ป Ghosted
R.I.P.
๐ป
Ghosted
Federated Learning: Strategies for Improving Communication Efficiency
R.I.P.
๐ป
Ghosted
In-Datacenter Performance Analysis of a Tensor Processing Unit
R.I.P.
๐ป
Ghosted
Deep Convolutional Neural Networks for Computer-Aided Detection: CNN Architectures, Dataset Characteristics and Transfer Learning
R.I.P.
๐ป
Ghosted