Data Smells in Public Datasets

March 15, 2022 Β· Declared Dead Β· πŸ› 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Arumoy Shome, Luis Cruz, Arie van Deursen arXiv ID 2203.08007 Category cs.SE: Software Engineering Cross-listed cs.LG Citations 23 Venue 2022 IEEE/ACM 1st International Conference on AI Engineering – Software Engineering for AI (CAIN) Last Checked 4 months ago
Abstract
The adoption of Artificial Intelligence (AI) in high-stakes domains such as healthcare, wildlife preservation, autonomous driving and criminal justice system calls for a data-centric approach to AI. Data scientists spend the majority of their time studying and wrangling the data, yet tools to aid them with data analysis are lacking. This study identifies the recurrent data quality issues in public datasets. Analogous to code smells, we introduce a novel catalogue of data smells that can be used to indicate early signs of problems or technical debt in machine learning systems. To understand the prevalence of data quality issues in datasets, we analyse 25 public datasets and identify 14 data smells.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Software Engineering

Died the same way β€” πŸ‘» Ghosted