Findings From the White Hat Cyber Forecasting Tournament
Published by The Lawfare Institute
in Cooperation With
A year and a half ago, we announced the White Hat Cyber Forecasting Tournament: a prediction platform for cybersecurity. The tournament was designed as part of a broader project to research alternative methods of gathering information about cybersecurity risks to increase the available public data in the cybersecurity industry and further its use of metrics.
In recent years, the importance of identifying and analyzing cybersecurity metrics at a federal level has become increasingly mainstream. There are ongoing efforts to pass congressional legislation that would create a Bureau of Cyber Statistics and to stand up frameworks for evaluating national cybersecurity. A Cyber Safety Review Board has just published its second in-depth analysis of a major cyber threat. Numerous private companies are working to collect and standardize metrics for cybersecurity performance. Still, there is a lack of consistent, high-level, and publicly available data. The tournament was our effort to explore a relatively unusual way of filling that gap.
The White Hat Cyber Forecasting Tournament asked over three dozen questions across a broad range of topics, including trends in geopolitics, government and government policy, events and incidents, industry statistics, and ransomware. The tournament ended on June 1, 2023. Here is what we learned. (A full record of results is also available at the bottom of this article.)
Hypothesis
In a series of earlier articles, we explored the concept of prediction markets for cybersecurity purposes and proposed that such a market could do three things:
- Aggregate information, turning private information into collective public wisdom and enabling us to better understand the industry.
- Decrease noise by allowing good information to rise while penalizing incorrect information.
- Serve as a testing ground—to create a space to directly compare information across sources and across time.
In establishing the White Hat Cyber Forecasting Tournament, our hope was that crowd-sourced predictions would be accurate and timely, and would allow us to draw preliminary conclusions about whether and how information drawn from prediction markets could offer lessons about the security of enterprises and systems. We partnered with a well-known prediction platform, Metaculus, to create a beta experiment to test our hypothesis.
Metaculus is an online forecasting platform and aggregation engine working to improve human reasoning and coordination on topics of global importance. By bringing together an international community and keeping score for thousands of forecasters, Metaculus enables the crowd-sourced creation of machine-learning-optimized aggregate predictions that help inform the public and support decision-making. To create the White Hat Forecasting Tournament, we identified key trends we wanted to track, or questions we wanted to ask, and then worked in partnership with Metaculus to scope and word each question according to their best practices and lessons from previous tournaments.
As described above, the goal of the experiment was to identify how and under what circumstances open-source, unclassified crowd-forecasting communities can generate accurate forecasts that could make it easier for practitioners and policymakers to make informed decisions that improve cybersecurity decision-making.
Tournament Results, by the Numbers
- 43 questions were asked in total: 38 were stand-alone questions, and the 39th-43rd questions were five-part questions.
- Of these 43 questions,
- Six were resolved ambiguously—that is, based on available data or because of inconsistencies, no answer could be given.
- 37 resolved to a specific answer.
- Of the 37 questions that resolved to an answer,
- 31 questions were binary (yes/no).
- Six were continuous, resolving to a single number or date out of a range of possible values (one of these resolved outside the specified bounds, which means it was scored as if it were a binary question).
- Of the binary questions, 30 resolved to “no” and one resolved to “yes.”
- Predictions for 28 of the 31 questions were on the “right side” of 50 percent, meaning that they resolved to “yes” when the prediction was above 50 percent and resolved to “no” when the prediction was less than 50 percent.
- The vast majority of the binary questions resolved to “no.”
- Examples of binary questions:
- Of the continuous questions, three were fairly close to the answer and three were not.
- Examples of continuous questions (the diamond symbol reflects the true resolution answer):
Quantitative Analysis
Assessing whether a forecast is “good” is surprisingly challenging. It is an assessment that relies on both the accuracy of forecasts and the difficulty of the questions. We can use mathematical accuracy scores to assess the accuracy of forecasts, but question difficulty is often subjective. Some questions are obviously easy. Others are clearly difficult. Many seem obvious with the benefit of hindsight but weren’t obvious at the time they were written. Otherwise, they would not have been asked!
One way to assess whether a set of forecasts is “good” is to compare their accuracy to a reference set of forecasts. In this case, we can compare the accuracy of forecasts in the White Hat Cyber Tournament to the accuracy of the more than 3,000 Metaculus questions that have been resolved over the past 8 years, a reference set that is remarkably well calibrated.
When we do this analysis, we find that the White Hat Cyber forecasts are more accurate than Metaculus usually is for binary questions. The community median prediction for the 32 questions scored as binary questions in the White Hat Cyber Tournament had an average accuracy score approximately 50 percent better than the average accuracy of all binary questions on Metaculus. The Metaculus Prediction for the White Hat questions, which uses a sophisticated model to calibrate and weight each user (giving greater weight to those who are systematically better at making predictions), had an average accuracy score approximately 90 percent better than the average score for the Metaculus Prediction across all Metaculus questions.
In turn, the White Hat Cyber questions are less accurate than typical Metaculus performance for continuous questions. For the five continuous questions, the average accuracy for both the community median and the Metaculus Prediction was approximately 50 percent worse than across all continuous Metaculus questions. (Five questions, however, is not a substantial sample size, and thus our results here may not be as robust as they seem.)
Can we conclude whether the forecasts were “good”? Ultimately, it depends on our assessment of how difficult the White Hat Cyber questions were. If the binary questions were of comparable difficulty to the average Metaculus question, then the binary forecasts were quite good. If the questions were easier, then the increased accuracy is less impressive. The reverse is true for the continuous questions. The underperformance relative to the reference set may be explained by the questions’ difficulty. But, if the questions were not that difficult, then the accuracy was lacking. Ultimately, the best assessment of whether a forecast is good may not be found in the mathematical analysis at all, but rather in the forecast’s utility to decision-makers.
We were also interested in the background of participants in the tournament. There was some indication that more experienced Metaculus participants did better than newer ones. It was unclear how many are “cyber” joiners versus regular Metaculus growth, but around 35 percent of participants in the early portion of the tournament were new Metaculus users. By contrast, in the quarter leading up to and including the launch of the White Hat Cyber Tournament, the number of forecasters on Metaculus grew by 25 percent month over month.
Preliminary Lessons
Lesson 1
More often than not, question outcomes aligned to the status quo: That is to say, for the most part, we tracked non-events. For example, questions about novel big crises or moments (“Will there be a successful cyberattack on the 2022 men’s FIFA World Cup?” “Will a popular online identity verification service be breached in 2022?” “Will a DDOS attack of greater than 3.5 TPS occur in 2022?”) were resolved to “no.” Perhaps, this was to be expected, but our hypothesis was that experienced practitioners in cybersecurity would have a better handle on the likelihood of such events than would the general public.
Our tentative results suggest, however, that the power of negative prediction was greater than the possibility of foreseeing an unlikely event. Indeed, even when a significant event had happened at least once before, the correct answer in the tournament often resolved to “no” (“Will it become public that the FBI sought a warrant to launch an operation to disrupt web shells on private computers in 2022?” which had happened once before, in 2021; “Will an impactful penetration of a software supply chain be discovered by a party other than themselves in 2022?” which had happened most recently in the Solarwinds breach).
This seems to indicate, preliminarily, that major cyber events are relatively rare—even if they receive outsized attention in the public eye. It also may indicate that the higher rate of accuracy in this tournament, relative to a baseline average of other Metaculus tournaments, was a result of many black swan questions resolving as “no.” In turn, the high overall accuracy may indicate that forecasts provide a more grounded picture of future cyber developments instead of focusing on hype—which would be a possible benefit of a prediction market. To put it another way, it may be that major cybersecurity incidents are black swans (that is, an event that comes as a surprise and has a major effect), and prediction markets are more useful in predicting non-black swan trends.
Lesson 2
On several occasions, timelines fell outside of the boundaries they were originally scoped within. Some questions were resolved much more quickly than anticipated (such as “Will the U.S. Congress pass into law a cyber incident reporting act for entities that own or operate critical infrastructure before June 1, 2022?” which was resolved in the time period between when the question was written and posted). Others clearly resolved on an open-ended time scale far too long for this particular tournament (for example, “When will the CMMC (Cybersecurity Maturity Model Certification) 2.0 rule-making process conclude?”).
This is not unanticipated for any prediction platform. In total, approximately 18 percent of the White Hat Tournament’s questions were resolved ambiguously, compared to about 8 percent for Metaculus’s other questions. (Again, however, this is relying on a small sample size: 18 percent is six questions.)
However, we have sufficient experience to also suspect that the indefiniteness may be an inherent characteristic of the unpredictable nature of the cyber domain. And, if the circumstance is more the latter than the former, that suggests prediction markets on policy-relevant cybersecurity questions require carefully scoped timelines in order to deliver useful forecasts. Any future effort to test prediction markets for cybersecurity will need to work more broadly and effectively to scope the timelines for the questions it seeks to resolve.
Lesson 3
A number of questions resolved ambiguously because information that we anticipated being available across multiple years was discontinued (for example, “What percent of reported ransomware attacks in Q1 2022 will be STOP/Djvu?” from Emsisoft’s quarterly ransomware reports; “What will the average cost of a ransomware kit be in 2022?” from Microsoft’s Digital Defense reports; and “What percent of global organizations will experience at least 10 successful email-based phishing attacks in 2022?” from Proofpoint’s State of the Phish).
While, again, our sample size is small, this is a notable trend: Even the entities that most consistently publish cybersecurity data on an annual basis do not appear to publish the same statistics from year to year. These are private companies, and they choose which data to publish—just because a certain statistic was reported in one year does not mean that it will be tracked in the next year.
This emphasizes, yet again, the need for a more robust set of data upon which public policy makers can rely.
Recommendations for Future Attempts
Recommendation 1: Identify a baseline.
Alongside any future efforts, recruit industry experts to serve as a “control” group. Catalog their impressions to determine what is expected and what is not expected. Thus, any future tournament conclusions can be compared to what experts believed would happen. This will enable policymakers to determine how much value the forecasts from the prediction platform truly offered.
Recommendation 2: Fewer “black swan” questions.
As explained above, while we hear frequently about the most impactful cyberattacks, it seems that they are truly rare and difficult to predict. It seems that more continuum-based questions could offer more opportunities to capture nuanced predictions about ongoing trends and risks.
In an ideal world, future prediction markets would also have questions fed in from the private sector (including insurance providers and threat analysts) and public sector (policymakers and researchers) to ensure a broader range of inputs and perspectives.
Recommendation 3: Recruit more participants from a broader audience.
While a number of questions received a lot of participation and a high number of forecasts, other questions were rather “thinly” answered—particularly more technical questions. A broader pool of recruited participants, incentivized to participate on a more regular basis, might help mitigate this problem.
So… Did the experiment work?
Yes! And no.
There is no such thing as a failed inquiry—we formed a hypothesis and asked the question: Broadly speaking, can prediction markets help with cybersecurity efforts?
Our answer is a mixed bag. There are some significant indicators of partial success—it is at least plausible that greater accuracy in the area is partially attributable to greater familiarity with the underlying subject matter. That suggests that cybersecurity predictions might have a systematic advantage over predictions in less specialized areas of knowledge. To the extent this is so, our results offer some promise of future utility.
However, there are also some cautionary indicators that a public prediction platform totally dedicated to cybersecurity forecasting will have additional challenges that aren’t easy to mitigate. Data availability looks to be a significant problem, as does the black swan nature of some cyber events. From this perspective, a prediction market will be of less value than we might have hoped.
So, if our fundamental research question is whether or not prediction markets can help with cybersecurity, our answer at this juncture is a tentative “maybe.” For such a limited experiment, we should probably not have expected anything more definitive.
Additional Information
Metaculus uses log scores to assess accuracy, which for binary questions produces a score of 0 for a forecast of 50 percent and 1 when accurately predicting the outcome with a prediction of either 0 percent or 100 percent. For continuous questions, a score of 0 is produced from a flat prediction (predicting that all outcomes are equally likely), and a perfect prediction can produce a log score around 6. Inaccurate predictions produce negative log scores.
See the tables below for the average log scores from the community prediction and Metaculus Prediction for the White Hat Cyber Tournament and for all other questions on Metaculus. Scores for each question were computed as the average score over the lifetime of the question. Note that the log scores between binary and continuous questions cannot be compared directly.
Average Log Scores for Binary Questions
Aggregation | White Hat Cyber (32 Questions) | All Metaculus Questions (~1,500 Questions) | Percent Difference |
Community Prediction | 0.43 | 0.29 | 51% |
Metaculus Prediction | 0.64 | 0.33 | 92% |
Average Log Scores for Continuous Questions
Aggregation | White Hat Cyber (5 Questions) | All Metaculus Questions (~1,300 Questions) | Percent Difference |
Community Prediction | 0.70 | 1.44 | -52% |
Metaculus Prediction | 0.75 | 1.44 | -48% |
We end with a warm thank you to Metaculus—and particularly to Gaia, Tom, Alyssa, Ryan, and Alex—for sponsoring this tournament and helping us build and analyze our questions and data set.