Making software failures a little less catastrophic

Researchers have implemented a new way to diagnose software failures with a high degree of accuracy and efficiency.

Baris Kasikci Enlarge
Prof. Baris Kasikci

There’s nothing more frustrating than losing a complicated project to a software crash – everyone’s had a video editor shut down without saving or lost a slideshow to a sudden corrupted file. Debugging such software failures is important because of their impact on users, but is notoriously hard in practice because of the limited information that survives once a crash occurs.

Prof. Baris Kasikci, along with researchers from Microsoft Research, Arizona State University, and Georgia Institute of Technology, is working to take these frustrations down a notch with a new technique called REPT – REverse debugging with Processor Trace. In their paper, “REPT: Reverse Debugging of Failures in Deployed Software,” they propose a method to recreate the failing program execution to better diagnose the problem at hand. They earned a Jay Lepreau Best Paper Award for the project at the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18).

REPT is a system that enables what the researchers call “reverse debugging” of software failures in live, deployed systems. Reverse debugging reconstructs the execution history of the software, both in the moments leading up to the crash and thousands of instructions beforehand. It does this with high fidelity, allowing the developers to identify what went wrong with a much higher degree of accuracy than is currently possible.

This data seemed impossible to recover because of the information loss that occurs in a software failure and a system’s concurrent execution of other processes. REPT tackles these challenges by constructing a partial execution order based on timestamps logged by hardware and iteratively performing forward and backward execution of each instruction with error correction.

The team has implemented and deployed REPT in Microsoft Windows, which is estimated to run on around a billion computers, and it is leveraged by Microsoft developers as a built-in feature of the Windows Debugger.