Most Ligand-Based Classification Benchmarks Reward Memorization Rather than Generalization

June 20, 2017 · Declared Dead · 🏛 Journal of Chemical Information and Modeling

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Izhar Wallach, Abraham Heifets arXiv ID 1706.06619 Category q-bio.QM Cross-listed cs.LG, stat.ML Citations 143 Venue Journal of Chemical Information and Modeling Last Checked 2 months ago

Abstract

Undetected overfitting can occur when there are significant redundancies between training and validation data. We describe AVE, a new measure of training-validation redundancy for ligand-based classification problems that accounts for the similarity amongst inactive molecules as well as active. We investigated seven widely-used benchmarks for virtual screening and classification, and show that the amount of AVE bias strongly correlates with the performance of ligand-based predictive methods irrespective of the predicted property, chemical fingerprint, similarity measure, or previously-applied unbiasing techniques. Therefore, it may be that the previously-reported performance of most ligand-based methods can be explained by overfitting to benchmarks rather than good prospective accuracy.

📄 View on arXiv 🌐 View on ar5iv 📑 PDF 🎉 Report Code Found

Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

📜 Similar Papers

In the same crypt — q-bio.QM

R.I.P. 👻 Ghosted

Deep Learning for Identifying Metastatic Breast Cancer

Dayong Wang, Aditya Khosla, ... (+3 more)

q-bio.QM 🏛 arXiv 📚 981 cites 9 years ago

R.I.P. 👻 Ghosted

GuacaMol: Benchmarking Models for De Novo Molecular Design

Nathan Brown, Marco Fiscato, ... (+2 more)

q-bio.QM 🏛 J.CIM 📚 846 cites 7 years ago

R.I.P. 👻 Ghosted

DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences

Ingoo Lee, Jongsoo Keum, Hojung Nam

q-bio.QM 🏛 PLoS Comput. Biol. 📚 522 cites 7 years ago

R.I.P. 👻 Ghosted

ProtVec: A Continuous Distributed Representation of Biological Sequences

Ehsaneddin Asgari, Mohammad R. K. Mofrad

q-bio.QM 🏛 PLoS ONE 📚 440 cites 11 years ago

R.I.P. 👻 Ghosted

A Perspective on Deep Imaging

Ge Wang

q-bio.QM 🏛 IEEE Access 📚 409 cites 9 years ago

R.I.P. 💀 404 Not Found

Deep learning in bioinformatics: introduction, application, and perspective in big data era

Yu Li, Chao Huang, ... (+4 more)

q-bio.QM 🏛 bioRxiv 📚 325 cites 7 years ago

Died the same way — 👻 Ghosted

R.I.P. 👻 Ghosted

Language Models are Few-Shot Learners

Tom B. Brown, Benjamin Mann, ... (+29 more)

cs.CL 🏛 NeurIPS 📚 54.2K cites 5 years ago

R.I.P. 👻 Ghosted

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, ... (+19 more)

cs.LG 🏛 NeurIPS 📚 49.7K cites 6 years ago

R.I.P. 👻 Ghosted

XGBoost: A Scalable Tree Boosting System

Tianqi Chen, Carlos Guestrin

cs.LG 🏛 KDD 📚 49.2K cites 10 years ago

R.I.P. 👻 Ghosted

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe, Christian Szegedy

cs.LG 🏛 ICML 📚 46.0K cites 11 years ago