Failures and Fixes: A Study of Software System Incident Response

August 25, 2020 Β· Declared Dead Β· πŸ› IEEE International Conference on Software Maintenance and Evolution

πŸ‘» CAUSE OF DEATH: Ghosted
No code link whatsoever

"No code URL or promise found in abstract"

Evidence collected by the PWNC Scanner

Authors Jonathan Sillito, Esdras Kutomi arXiv ID 2008.11192 Category cs.SE: Software Engineering Citations 17 Venue IEEE International Conference on Software Maintenance and Evolution Last Checked 4 months ago
Abstract
This paper presents the results of a research study related to software system failures, with the goal of understanding how we might better evolve, maintain and support software systems in production. We have qualitatively analyzed thirty incidents: fifteen collected through in depth interviews with engineers, and fifteen sampled from publicly published incident reports (generally produced as part of postmortem reviews). Our analysis focused on understanding and categorizing how failures occurred, and how they were detected, investigated and mitigated. We also captured analytic insights related to the current state of the practice and associated challenges in the form of 11 key observations. For example, we observed that failures can cascade through a system leading to major outages; and that often engineers do not understand the scaling limits of systems they are supporting until those limits are exceeded. We argue that the challenges we have identified can lead to improvements to how systems are engineered and supported.
Community shame:
Not yet rated
Community Contributions

Found the code? Know the venue? Think something is wrong? Let us know!

πŸ“œ Similar Papers

In the same crypt β€” Software Engineering

Died the same way β€” πŸ‘» Ghosted