Did the Astros set a record by challenging a play while up 13 runs? (no)

Yesterday the Astros beat the Angels 16-2. In the eighth inning, when they were only up 15-2, Dusty Baker challenged a play on the field, which I thought was pretty funny – since at least 1957, no visiting team has won the game after being down by 13 in the top of the 9th. (in fact, you have go to down to an 8 run lead before seeing that happen!)

Seeing this made me chuckle, and then wonder if that was a record for manager challenges. It had a good shot at it – manager challenges have only been around since 2014, and it’s not like there are a lot of games where a team is ahead by 13 runs!

So I added some code to my baseball win expectancy finder, and the answer is almost but not quite! In 2017 Oakland was ahead of Kansas City 14-0 in the bottom of the eighth inning and challenged a play (and got it overturned, too!) It’s also happened two other times with 13 run leads, although I didn’t bother check whether the leading team was the one who challenged.

Once again, the lesson learned is that there have been many, many games of baseball played πŸ™‚

Here’s the raw output for those who are interested.

Other articles I’ve written with this baseball data:

Why are so many runs scored in the bottom of the first inning?

After starting to look at some inning-by-inning data from my baseball win expectancy finder for another project, I stumbled across something weird that I can’t explain. Here’s a graph of expected runs scored per inning:

Check out how high the bottom of the first inning is – on average 0.6 runs are scored compared with 0.5 runs in the top of the first. That’s a huge difference! Here’s a graph of the difference:

Holy outlier, Batman! So what’s going on? Here are some ideas:

  • Teams score more in the first inning because the top of the lineup is at bat – this is true! You can see in the top graph that the expected runs scored in first inning is the highest for both the home and visiting teams. (see this Beyond the Box Score article that discusses this) But that doesn’t nearly explain why the home team does so much better than the visiting team!
  • Starting pitchers are more likely to have a terrible first inning – This might be true, but I can’t think of any reason why this would affect visiting starting pitchers more than home starting pitchers. I also made a graph of the home advantage for each number of runs scored for the first and third inning (I picked the third inning because that’s the second-greatest difference between home and visitor):

To me, these look almost exactly the same shape, so it’s not like the first inning has way more 6 run innings or anything.

  • This is just random chance – I guess that’s possible, but the effect seems large given that the data has more 130,000 games.
  • There’s a bug in my code – I’ve been writing code for 20 years, and let me tell you: this is certainly possible! In fact, I found a bug in handling walkoff innings in the existing runs per inning code after seeing some weird results in this investigation. But it would be weird to have a bug that just affects the bottom of the 1st inning, since it isn’t at the start or end of the game. I also implemented it in both Rust and Python, and the results match. But feel free to check – the Rust version is StatsRunExpectancyPerInningByInningReport in reports.rs, and the Python version is StatsRunExpectancyPerInningByInningReport in parseretrosheet.py.
  • This is different between baseball eras – I don’t know why this would be true, but it was easy enough to test out, and the difference is pretty consistent. (see the raw data)
  • The fact that home teams are usually better in the playoffs bias this – I think this is a tiny bit true, but I reran the numbers with only regular season games (where the better team has no correlation with whether it’s the home or visiting team) and the difference looks almost exactly the same.

So, in conclusion, I don’t know! If anyone has any ideas, I’d love to hear them on this post or on Twitter.

Edit: Ryan Pai suggested on Facebook that the visiting pitcher has to wait a while between warming out and pitching in the bottom of the 1st, which is an intriguing theory!

Odds and ends:

  • That top “expected runs per inning” graph has some other neat properties – for example you can see that the 2nd inning is the lowest scoring inning, presumably because something near the bottom of the lineup is usually up.
  • Another thing you can see is how robust the home field advantage is. In every inning the home team scores, on average, a little more than the visiting team!
  • The graph only shows 8 innings because in the 9th inning things get complicated. For one thing, the bottom of the 9th inning only happens if the home team is behind or tied, which biases the sample somewhat. Also, if the game is tied and the home team hits a leadoff home run, they win the game but lose the opportunity to score any more runs.
  • You can also notice the strangeness of the bottom of the 1st inning another way. If you look at the chance that the home team will win when the game is tied, their chances are better at the beginning of the bottom of the 9th than the bottom of the 8th, because they have an extra chance to bat. That advantage gets lower the earlier in the game you go, with one exception. In the bottom of the 1st, the home team has ~59% chance to win, but in the bottom of the 2nd that goes down to ~58%! The reason is that if the home team misses their chance to score runs in the bottom of the 1st they’ve missed a big opportunity, apparently!
  • The raw report data is here in the GitHub repo.

Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech review

Technically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic TechTechnically Wrong: Sexist Apps, Biased Algorithms, and Other Threats of Toxic Tech by Sara Wachter-Boettcher

My rating: 3 of 5 stars

I have mixed feelings about this book. I feel like if you’re designing consumer products in the tech industry you should definitely read this book. Or conversely if you use social networks, etc. but haven’t thought about how much influence design decisions can have, you’d probably get a lot out of this book. But I’m kind of in the squishy middle, where I’ve heard a lot about this sort of stuff, but it doesn’t apply to me directly.

Which is not to say I didn’t get anything out of this book. Wachter-Boettcher’s thesis is that the tech industry has convinced itself that it’s a meritocracy of the best and brightest, which means that tech companies:
– don’t make products that are biased, they’re just based on algorithms (as if algorithms can’t be biased!)
– aren’t sexist or racist, they just hire the best people for the job

Sadly, both of these points are entirely wrong, but they are indeed commonly held in the industry based on what I read.

Odds and ends from the book:
– Wachter-Boettcher makes the point that a lot of design teams can just default to catering to the “average” user, and other people are “edge cases”. But people aren’t really “average”; she cites the study done in the 1950s on Air Force fighter pilots where they calculated their average dimensions in their shoulders, chest, waist, etc. Not a single pilot was in the middle 30 percent of all ten measurements.
– The book is full of examples of companies just not thinking about these “edge cases”. One of the interesting ones is about names. The author has five names (including a hyphenated last name), so she’s had some experience in dealing with systems that can’t handle her name. She describes this as being like a microaggression, which makes sense to me. (I feel the same way whenever I see forms for the kids that ask for a mother and father…) This spins off into a discussion of Facebook’s policy that you have to use your real name. But there are a lot of reasons people don’t want to use their real names – political refugees, victims of stalking, drag queens, etc. Facebook did eventually bend on this policy, but it was still much easier for people to report you for using a fake name than for you to respond to it. Wachter-Boettcher also talks about how Facebook’s messages about this also became more user-friendly, from “Your Name Wasn’t Approved” to “Help Us Confirm Your Name”.
– There’s a discussion about racism on Nextdoor, the social network for your physical neighbors. This was an infamous problem; people would often report suspicious activity whenever they saw a non-white person, sigh. The book says that Nextdoor banned racial profiling, but also spent a long time redesigning the form that people used to report suspicious activity to emphasize clothing, hair, etc instead of race. They also added rules that you can’t just specify someone’s race, and Nextdoor claims that all of these changes together reduced racial profiling by 75%. This came at the cost that a lot more people abandon the new form without submitting it, and most of the time this reduced “engagement” would be considered a terrible thing.
– Wachter-Boettcher talks about how in 2012 Google let you see what it thinks your interests, age, and gender is. (you can still see it here if you haven’t turned off ad personalization) And people realized that Google thought women that were interested in tech stuff (including the author!) were men. This isn’t a big surprise because Google’s algorithm was trained on what it saw in the past – that more men than women are interested in tech stuff. But now this has a cascading effect where the algorithm thinks it’s even more likely that people interested in tech stuff are men!
– There’s a discussion of ProPublica’s investigation into the COMPAS algorithm that is used to predict recidivism in people convicted of crimes, and how ProPublica found that the algorithm is biased against black people.

View all my reviews

Calculating the probability a model is broken from one bad prediction

Let’s say you have a model that gives you the probability that events will happen. All you know about the model is that it says a certain event has a one in a million chance of happening, and then that event does happen. What are the chances that the model is broken?

I had a discussion with baseball stats guru Tom Tango about this on Twitter: (see ensuing thread)

Tom’s point, which I agree with, is that models are not going to be “right” all the time. There’s only a ~3% chance of rolling double 6’s on a pair of dice, but if you pick up a pair of dice and roll double 6’s you probably don’t think that the dice are unfair! And just ask Nate Silver about his 2016 election model’s prediction that Trump only had a 30% chance of winning; just because that happened doesn’t mean that the prediction was wrong.

But, if the only thing you know about a model is that it gave an event a one in a million chance of happening, and then it happened I think you have to think it’s more likely that the model is wrong. Let’s try to do some math to figure this out.

I think the best way of doing this is using Bayesian inference. Let’s try to break this down. Say

  • P(M) is the probability a model is correct
  • P(E) is the probability that when the model predicted an event had a one in a million change of happening, that event happened.

So what we want to figure out P(M|E). (the probability of M given that E happened) Bayes theorem tells us that

P(M|E) = P(E|M)*P(M)/P(E)


  • P(E|M) = the probability that E happened given that the model was correct, which is one in a million.
  • P(M) = this is the prior probability that the model is correct. This depends on what you know about the model, but let’s say you wrote the model yourself and you’re 99% sure there are no bugs πŸ™‚
  • P(E) = ummm…this seems hard to evaluate. Let’s try to break it down; either the model is correct or it isn’t, so

P(E) = P(E|M)*P(M) + P(E|~M)*P(~M)

We already know P(E|M) and P(M) from above, and P(~M) is 1-P(M)=.01. But what is P(E|~M)? If the model is wrong, what’s the probability that the “one in a million” thing happened? This seems to require knowing how likely the event really is to happen, and if we knew that we wouldn’t need a model! I guess we can use an extremely naive estimate – either the event happens or it doesn’t, so P(E|~M) = 0.5. (Edit: on Facebook, Gary pointed out that one way to handle this is to define what’s “sufficiently wrong”, since if the real probability is 1/999,999 we probably wouldn’t call the model incorrect. Then you can use that probability, for example 1/500,000 here, which makes a lot of sense to me!) I am skeptical this is the right way to do it, but, this makes

P(E) = 0.000001 * 0.99 + 0.5 * 0.01 = 0.00500099


P(M|E) = 0.00000099/0.00500099 = 0.000198

or .01%, so there’s very little chance the model is right.

A few odds and ends:

  • I tried reading more about Bayesian inference to figure out what to do about P(E) but didn’t find anything helpful. If anyone knows, please comment below!
  • I think the general lesson is that you want your model to make lots of predictions to see if it’s calibrated well. If your model predicts things that are more likely than 50% to happen and is right, you can do the same sort of calculation here to get more confident it’s correct, and build up a buffer against very wrong predictions like this.
  • But probably the best way to do this is to do what 538 does, make lots and lots of predictions, and analyze them to see if they’re well-calibrated. Of course, to do this for events that have probabilities like one in a million, you’d have to make at least a million predictions, which is tough.
  • I think this also drives home that a one in a million thing happening is very very very rare, and we shouldn’t underestimate that. Just as a random reference, perfect games in baseball are very rare and they seem to have about a 1 in 10000 chance of happening – 100 times more likely than one in a million!