Duplicate bug reports and how to handle them

Three red and black striped bugs sitting on a plant stem.

One of my all-time favorite sayings comes from Paulo Coehlo’s The Alchemist: “Everything that happens once can never happen again. But everything that happens twice will surely happen a third time.” It’s a delightful mix of impossible and totally true. And it’s a handy saying for dealing with bug reports. Many reports are unique based on some combination of platform, use case, input, and so on. You may not even be able to reliably reproduce some of them. Others, though, pop up over and over again. If you have an engaged user community, this means they will report this bug over and over again.

Reports aren’t bugs and duplicated isn’t duplicate

Not every bug in your software has a corresponding bug report in your tracker. Furthermore, not all of the bug reports in your tracker are bugs in your software. Some are questions, feature requests, upstream or downstream issues, spam, and so on.

It gets even worse. Not all duplicated bug reports are duplicate bug reports. What? For a bug report to truly be a duplicate, it has to be marked as such. Of course, two reports can be substantially the same and describe the same single bug, and you’re well within your rights to call them duplicates of each other. But to treat a bug report as a duplicate — in analysis, searches, etc — it needs to be assigned a “duplicate” designation.

The dynamics of duplicate bug reports

Predicting the scale of bug reports is a challenge, to say the least. At a basic level, you can consider the number of bug reports for a project to be a function of code, community, and time. Generally speaking, lower-quality code will have more bugs, as will more complex code. The number of bug reports should roughly correlate to the number of bugs in the software. The degree of correlation is a reflection of the community. When the user community is larger and more engaged, more bugs will be found and reported. When the developer community is welcoming and receptive, users will be more comfortable filing bug reports. And, of course, not all bugs will be found immediately. As more time passes, more bug reports will come in.

Duplicate bug reports add even more complexity to this dynamic. Because a duplicated bug report has to be explicitly marked before it becomes a duplicate bug report, the number of reports plays a factor. In theory, you might expect the percentage of duplicate reports to be relatively steady and grow as the the number of total reports grows. A larger set of reports makes it harder for people to find an existing report, so they end up filing a new one.

Instead, what seems to happen is that triagers are less able to find reports and so miss marking incoming bug reports as duplicate. (If triagers are unable to find reports, it’s likely true that reporters have the same problem. What’s probably happening is that the percentage of duplicated reports is going up at the same time as the percentage of duplicate reports drops. Tragic!) You can see this in an analysis of Fedora’s Bugzilla I did a few years ago.

Dot plot showing bug report totals and duplicate report percentages
Plot of duplicate report percentage (Y-axis) as a function of total reports (X-axis) for components in Fedora Bugzilla.

The figure above shows each component in Fedora’s Bugzilla with the component’s total (all-time) bug reports as the X coordinate and the percentage of reports marked duplicate as the Y coordinate. At roughly 1,000 total reports, we start to get more signal and less noise. The sample size is small, but the trend seems to be fairly steady around 30–40% until it drops off somewhere north of 6,000 total reports.

Attaching numbers to this gives you a sense of how difficult duplicate bug reports can be to manage.

Preventing duplicate bug reports

I’ve spent about 600 words telling you something that may or may not be interesting in the abstract, but doesn’t give you, a busy open source project leader, anything useful to work with. That changes now.

The natural question that arises is “how can I prevent duplicate bug reports?” After all, they’re a waste of everyone’s time. Now that you understand the complicated dynamics of duplicate bug reports, you recognize that there’s no easy answer. But there are steps you can take.

If your bug tracker supports templates (a consideration I didn’t list in the appendix to Program Management for Open Source Projects, but wish I would have), include a link that reporters can use to search for existing reports, and remind them to do so. The Zizmor project’s template is a good example. Some trackers will also attempt to highlight potential existing reports as the reporter enters their report.

I also suggest leaving reports for fixed bugs open until the fix is released. (If I ever give the “Your bug tracker and you” talk again, I will add that.) “Released” can have a variety of definitions depending on how you get your software to the user. In general, though, “committed into the branch that will be used for the next release” does not count as “released.” Even if you specifically direct people to search for closed bug reports, some won’t. Plus, as time goes on and the number of closed reports increases, it will be harder for people to find the match.

I can’t let this section end without mentioning documentation. Just because a report is duplicated, that doesn’t mean it’s a valid report. Duplicated misunderstanding is a double-whammy of time waste. Good, discoverable documentation can help you prevent these sorts of duplicates.

Handling duplicate bug reports

When you see a new bug report come in that duplicates a previous report, you need to indicate that. Close it with a “duplicate” status, add a “duplicate” label, or whatever your tracker supports. But also reply to the reporter. Thank them for reporting the bug and let them know it’s a duplicate. Remember that they’re not trying to waste your time (or their own) and that we’re all just people trying to do the right thing. If you’re kind, the next bug report they file could be important. If you’re unkind, there may not be a next bug report.

If you’ve fixed the underlying bug but not included it in a release, you should make that clear. A “release-pending” label or status will help people see what’s coming. It goes back to the “people will have an easier time finding open reports than closed reports” idea that I mentioned previously. And I suspect that for most projects (especially ones that don’t support multiple releases simultaneously), duplicates will be relatively recent, so you can drop the “search for closed reports, too” instruction entirely.

When it comes to analyzing your bug reports, you’ll typically want to discard duplicate reports. If the question is “how many duplicate reports do we get?” then the duplicates matter. In general, they don’t. Duplicate reports are usually closed more quickly than other bug reports, so including duplicates in your analysis can distort the average and other statistical measures.

This post’s featured photo by Erik Karits on Unsplash.

Ben is the Open Source Community Lead at Kusari. He formerly led open source messaging at Docker and was the Fedora Program Manager for five years. Ben is the author of Program Management for Open Source Projects. Ben is an Open Organization Ambassador and frequent conference speaker. His personal website is Funnel Fiasco.

Share

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.