The (ab)use of Open Source Code to Train Large Language Models

February 27, 2023 Β· Declared Dead Β· πŸ› 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE)

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Ali Al-Kaswan, Maliheh Izadi arXiv ID 2302.13681 Category cs.SE: Software Engineering Cross-listed cs.AI Citations 27 Venue 2023 IEEE/ACM 2nd International Workshop on Natural Language-Based Software Engineering (NLBSE) Last Checked 4 months ago
Abstract
In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Software Engineering

Died the same way β€” πŸ‘» Ghosted