Calculating the probability a model is broken from one bad prediction

Let’s say you have a model that gives you the probability that events will happen. All you know about the model is that it says a certain event has a one in a million chance of happening, and then that event does happen. What are the chances that the model is broken?

I had a discussion with baseball stats guru Tom Tango about this on Twitter: (see ensuing thread)

Tom’s point, which I agree with, is that models are not going to be “right” all the time. There’s only a ~3% chance of rolling double 6’s on a pair of dice, but if you pick up a pair of dice and roll double 6’s you probably don’t think that the dice are unfair! And just ask Nate Silver about his 2016 election model’s prediction that Trump only had a 30% chance of winning; just because that happened doesn’t mean that the prediction was wrong.

But, if the only thing you know about a model is that it gave an event a one in a million chance of happening, and then it happened I think you have to think it’s more likely that the model is wrong. Let’s try to do some math to figure this out.

I think the best way of doing this is using Bayesian inference. Let’s try to break this down. Say

  • P(M) is the probability a model is correct
  • P(E) is the probability that when the model predicted an event had a one in a million change of happening, that event happened.

So what we want to figure out P(M|E). (the probability of M given that E happened) Bayes theorem tells us that

P(M|E) = P(E|M)*P(M)/P(E)

and

  • P(E|M) = the probability that E happened given that the model was correct, which is one in a million.
  • P(M) = this is the prior probability that the model is correct. This depends on what you know about the model, but let’s say you wrote the model yourself and you’re 99% sure there are no bugs 🙂
  • P(E) = ummm…this seems hard to evaluate. Let’s try to break it down; either the model is correct or it isn’t, so

P(E) = P(E|M)*P(M) + P(E|~M)*P(~M)

We already know P(E|M) and P(M) from above, and P(~M) is 1-P(M)=.01. But what is P(E|~M)? If the model is wrong, what’s the probability that the “one in a million” thing happened? This seems to require knowing how likely the event really is to happen, and if we knew that we wouldn’t need a model! I guess we can use an extremely naive estimate – either the event happens or it doesn’t, so P(E|~M) = 0.5. (Edit: on Facebook, Gary pointed out that one way to handle this is to define what’s “sufficiently wrong”, since if the real probability is 1/999,999 we probably wouldn’t call the model incorrect. Then you can use that probability, for example 1/500,000 here, which makes a lot of sense to me!) I am skeptical this is the right way to do it, but, this makes

P(E) = 0.000001 * 0.99 + 0.5 * 0.01 = 0.00500099

and

P(M|E) = 0.00000099/0.00500099 = 0.000198

or .01%, so there’s very little chance the model is right.

A few odds and ends:

  • I tried reading more about Bayesian inference to figure out what to do about P(E) but didn’t find anything helpful. If anyone knows, please comment below!
  • I think the general lesson is that you want your model to make lots of predictions to see if it’s calibrated well. If your model predicts things that are more likely than 50% to happen and is right, you can do the same sort of calculation here to get more confident it’s correct, and build up a buffer against very wrong predictions like this.
  • But probably the best way to do this is to do what 538 does, make lots and lots of predictions, and analyze them to see if they’re well-calibrated. Of course, to do this for events that have probabilities like one in a million, you’d have to make at least a million predictions, which is tough.
  • I think this also drives home that a one in a million thing happening is very very very rare, and we shouldn’t underestimate that. Just as a random reference, perfect games in baseball are very rare and they seem to have about a 1 in 10000 chance of happening – 100 times more likely than one in a million!

State Election Map updates for 2020!

I made some updates to my State Election Map! The most obvious one is including the 2020 data, just in time for the Electoral College vote today. But wait, there’s more!

Each year now includes:

  • Tipping point state – this is the “decisive” state that provides the Electoral College win. More specifically, I use FiveThirtyEight’s definition, which is to sort every state by the election winner’s margin, and add up electoral votes until they get to 270. (with one exception, see note below about Maine/Nebraska) If you subtract this state’s margin from the overall national popular vote you can calculate the electoral college split. This is one way to look at how close the election was.
  • Closest state (by percentage) – Simply the state that was closest to a tie. (by percentage and not raw votes)
  • Minimum number of additional losing-party votes so the losing-party would have won: If you had a magic wand and could conjure up votes for the losing party to make them win, in what states would you add those votes to have to conjure up the minimum number possible? This is another measure of how close the election was; for example, in 1976 Carter won 297-241, and the tipping point state was Wisconsin which was D+1.7%, but a mere 18,490 more Republican votes in Ohio and Hawaii(!) would have let Ford win, which is extremely close.
  • Graph of how each state changed since the last election: This data was already available on a per-state basis (by clicking on the state), but seeing it on a per-year basis can show some interesting trends.

Note about Maine and Nebraska – these states are alone in allocating their electoral votes per congressional district, which is an irritant for visualizations like these 🙂 (and why I hadn’t wanted to add electoral vote data before) My solution was lazy, simple, and only a little wrong: I show the correct EV total for each election, but for the rest of the EV calculations ignore the split votes. Edit: this has been fixed now, see this update.

Some interesting observations of the data:

  • I added the “minimum number of votes to change the result” metric because I remember hearing that about Trump’s win in 2016 – if Clinton had gotten just 77,000 or so more votes in Michigan, Pennsylvania, and Wisconsin, she would have won, which is much closer than the 306-232 electoral vote victory. Which is true! But it turns out that by this metric, Biden’s win in 2020 was even closer – he had the same 306-232 electoral vote victory, but only ~43,000 more votes would have made Trump tie in the electoral college and presumably win given that the House of Representatives then votes on the winner, but each state gets one vote.
  • The split between the popular vote and tipping point state, which was just under 3% in 2016 and talked about a lot, increased to almost 4% in 2020, which is pretty scary! Part of the problem is that Pennsylvania and Wisconsin got a little more Republican-leaning relative to the popular vote. (although Michigan became a little more Democratic-leaning)
  • You can compare the 2020 election to 2016 and get an idea of which states disliked Clinton the most. Obviously this isn’t an exact science, but the clear winner to me, to my surprise, is Vermont! It’s the state with the biggest D shift from 2016 to 2020 (+9.0%), and it doesn’t seem to be because the state is generally trending Democratic. (like Colorado, which had the second highest D shift) Conveniently, the third highest D shift was in Delaware, Biden’s home state.
  • The new graph makes it clear how much Utahns didn’t like Trump – the shift from 2012 (when, admittedly, Romney was running) to 2016 was D+30%! That’s the biggest shift between elections since Hawaii moved D+36.5% between 2004 and 2008. I think there’s research showing that home state effects are larger if your state is smaller, which these seem to be consistent with.

The “minimum number of votes to change the result” calculation turned out to be more interesting than I thought. (I’m going to assume that an R won the election for the sake of concreteness) At first I thought it would be a simple matter of taking the closest state that the R won by percentage, add however many votes it takes to make the D win, and repeat with the next closest state until the overall result changes. But there are two reasons this won’t work:

  • You might end up adding unnecessary votes. If the R won the electoral votes 280-258, and the closest state the R won had 3 EV, and the next closest state the R won had 12 EV, there’s no need to flip the votes in the 3 EV state – the 12 EV state is sufficient by itself!
  • Since a state’s EVs aren’t exactly proportional to its population (much less the number of people who voted), just looking at the closest states by percentage may not be the right order to look at things.

After thinking about this a little, it sounded a lot like a 0-1 knapsack problem, but not quite. The knapsack problem means you’re picking some things to add to a knapsack that can only hold a certain number of pounds, and you want to maximize the value of the items. In this case, we’re trying to pick states so that their “weight” (EVs) is greater than a certain value instead of less than, and we’re trying to minimize the “value” (number of votes to flip) instead of maximize.

I thought about trying to use negative weights and values to turn this into the knapsack problem, but I couldn’t get it to work out. Eventually I realized that we could flip it around – look for the states to keep in the R’s column that have at most 268 EVs, and maximize the votes the R has, which is exactly the same as what we want.

The knapsack problem is NP-complete, and I was worried it would be slow to run, and also I love Rust now (see my Rust posts), so I decided to calculate it offline and just load the results in the UI. I found an example implementation of the knapsack problem in Rust and adapted it for my use (see my implementation here). It turns out with a maximum weight of 268 and only at most 50 states, the whole algorithm runs in less than a second, so this might have been overkill. But it was fun!

clue solver now in Rust with more accurate simulations!

linked to by This Week in Rust!

Here’s a new version of my clue solver with the engine rewritten in Rust and more accurate simulations!

I wrote the original Python engine in 2007 and haven’t really touched it since then (other than to upgrade it to Python 3), so rewriting it in Rust was an interesting experience. Here are some categorized thoughts about Rust, software development, and simulations/math:

Rust

Rust is my new favorite language. (you can read my initial thoughts about it here, and all my Rust posts here) Rewriting the engine in Rust wasn’t particularly tricky, with a few exceptions I’ll talk about below, and it sped up simulations by over 4x without any real effort! (on my desktop machine a simulation went from taking ~2.2 seconds to ~0.5 seconds, although I made some changes after that I’ll talk about in the Simulations section)

  • I’m really digging the feel of Rust; it’s kind of a low-level language (you have very fine control of where copies happen) with the convenience of lots of helpful libraries and nice language features.
  • Having to declare which variables/references are mutable versus not is not only helpful for performance/safety, it also makes for nice documentation on methods. I noticed this because more than one place in the Python code I had written comments saying I had to make a copy to avoid modifying the original. This is similar to const-correctness in C++ which is also nice but less enforced by the language.
  • I declared an enum for all the possible cards. (which is safer than the strings I used in Python; I’m eternally afraid of typos!) But in Rust, an enum is more powerful than a C/C#/Java enum; it can have arbitrary data associated with it. This is a handy feature, but it means that the language is careful about the difference between an enum value and a reference to an enum value when in my case it doesn’t matter, since they’re the same size. I was constantly using card when I really wanted &card or *card. Luckily the compiler error messages are incredibly helpful so it was always clear what I needed to use, but it was still annoying.
  • The ? operator makes handling errors so easy! Part of this project was also rewriting the CGI script in Rust, which meant handling all sorts of errors if the right query parameters aren’t specified and whatnot. But ? makes easy to early return if a suboperation fails and made the CGI code much nicer!
  • Using Visual Studio Code for Rust development using the rust-analyzer extension was pretty nice – it annotates variables with types and method parameters with their names! The only downside was that sometimes when I pasted some code in it would freeze for a few seconds, although this didn’t seem to happen all the time.
  • One note – while porting the simulation algorithm to Rust increased its speed by over 4x, this is for the release version. The debug version in Rust was actually 3x slower than the Python version! So just a reminder to always benchmark in release mode 🙂
  • Rust also does backwards type inference, which is pretty cool. So in a method you can declare a variable with let mut some_variable; then do some stuff with it, then return it, and Rust will figure out it must be the same type as the return type of the method. (and with rust-analyzer it will show the type in the IDE!)
  • I did run into some Rust-specific hurdles when I was declaring my data structures. In Python the ClueEngine class has a list of CluePlayer‘s, and each player has a reference back to the engine. In Rust this is tricky to do because of lifetime management and…stuff. I read some articles about how graphs are usually declared in Rust (a similar problem) and tried to copy that but it never quite worked. At one point my Rust ClueEngine had a RefCell<Vec<Rc<RefCell<PlayerData>>>> which I’m guessing is wrong 🙂 Anyway, even if it had worked using Rc and RefCell means that you lose the safety of compile-time checking and get run-time checking instead, which is scary. It turned out I just had to restructure things a little bit to avoid needing a reference from CluePlayer to ClueEngine and it was well worth it.
  • After I fixed the simulation problem I was able to parallelize it using Rayon (like I did for the population centers project) pretty easily! It resulted in about a 5x speedup from the non-parallel version.

General

One big difference I noticed from the Python code I wrote in 2007 was that I was much more eager to add unit-type tests. I had an OK set of integration tests covering some bugs I had had to fix in Python, but I expanded that a bunch. Test counts aren’t everything, but I had 20 tests in Python and I have 65 engine tests in Rust, plus 40 tests for the CGI server in Rust. (I didn’t have any CGI tests before) These tests gave me a lot of confidence to make performance improvements in Rust, as well as tweak the simulation algorithm.

Simulations

I had suspicions that simulations were wrong before but wasn’t sure what exactly was wrong. Let me walk through a simple example of what the old algorithm said and how I fixed it.

Let’s say you’re at the very beginning of a 6-player game; you’re Player 2. You don’t have any suspect cards, so as far as you’re concerned each suspect has a 16% chance of being the solution. Player 1 makes a suggestion (ProfessorPlum, Knife, Hall) and it goes by all the way to Player 6, who shows a card, but you don’t see what it is.

The key question here is: according to you, has the chance that ProfessorPlum is the solution gone up or stayed the same?

This sounds a lot like the Monty Hall Problem, but it’s not the same because there’s no host manipulating things. The answer that after talking to David, we’re pretty convinced about is that the chance that Player 6 has ProfessorPlum goes up significantly, and the chance that ProfessorPlum is the solution goes up but not as much.

The old algorithm did not show the solution chance going up at all. Here’s roughly how the old algorithm worked:

  • Look at what cards might still be the solution (so omit cards that we know a player has)
  • For each solution combination (Suspect, Weapon, Room):
    • Update the engine’s state to say this combination is the solution.
    • Run N simulations. For each simulation:
      • For each player, randomly assign her cards from the cards we don’t know who has, but skip ones we know she doesn’t have.
      • After this is all done, see if the engine is consistent. If so, increment the count for the current configuration.

The bold part was wrong. I added that to try to reduce the number of inconsistent simulations, but it biases the results. In the situation we constructed above, it means that every simulation will be consistent, and since we weight all solutions equally, the solutions will all have equal probability. Perhaps there’s some way to do a Markov Chain Monte Carlo-like thing where we could ensure the solution is valid but weight it appropriately, but that would take more research.

The fix was pretty simple – just take the bold part out and assign the cards blindly. And that’s pretty much it! Because we’re throwing out more inconsistent simulations now, I increased the number we run from 2,000 to 20,000, but then got most of that performance back by parallelizing the algorithm.

There was another small fix when I realized that we have to extend the “don’t try to be too clever and avoid inconsistent states” even back to when we assign the solution – otherwise if we know either Player 6 or the Solution has ProfessorPlum, once we pick a non-ProfessorPlum solution we will know Player 6 has ProfessorPlum, and we shouldn’t start with that knowledge when assigning cards.

I am a little worried that even 20,000 simulations won’t be enough for more complicated game states, so I might play around with that some more.

Anyway, probability is hard!

ProbabilityToFriendlyString with qualitiative descriptions (and in 8 languages!)

TL;DRProbabilityToFriendlyString now also returns a qualitative description of a probability (“Still possible”, “Flip a coin”, “Good chance”, etc.). See a demo here, and credit to the New York Times for coming up with these!

After the initial release of ProbabilityToFriendlyString, I periodically ported it to more languages, and it’s now done in 8. (JavaScript, Python, C#, Java, Ruby, PHP, Rust, and LabVIEW NXG) I did this as a way of refreshing my knowledge about these languages – Rust was the only one I had never used before, although it had been a long time for some of them! It was also neat to learn how to publish packages to various repositories. I originally was trying to do things in a very idiomatic way in each language, but I’m not sure I really succeeded at that. It also makes a very nice GitHub language bar:

Anyway, when I was obsessively watching the New York Times’ Iowa caucus projects (…yeah) together with the probability there was a qualitative description (the “Eventual winner?” column), and it seemed like a nice addition to the library. The downside is that it’s kind of a pain to update logic and tests for 8 different languages!

I originally was just watching the page and recording a bunch of data so I could derive the rules for what string gets used when. After doing that for a while (and getting frustrated because the whole Iowa mess was making the page not update frequently!) I did some spelunking and found the source, which made it very clear what the rules were. Fun stuff!

I also did this the proper Git way with branches and stuff, and discovered that Fork is a pretty nice GUI client! Previously I used the command-line and GitHub Desktop for my basic needs. At work I use Visual Studio for branching and stuff, and Fork is even more powerful but still pretty easy to use. Kudos!