After starting to look at some inning-by-inning data from my baseball win expectancy finder for another project, I stumbled across something weird that I can’t explain. Here’s a graph of expected runs scored per inning:
Check out how high the bottom of the first inning is – on average 0.6 runs are scored compared with 0.5 runs in the top of the first. That’s a huge difference! Here’s a graph of the difference:
Holy outlier, Batman! So what’s going on? Here are some ideas:
- Teams score more in the first inning because the top of the lineup is at bat – this is true! You can see in the top graph that the expected runs scored in first inning is the highest for both the home and visiting teams. (see this Beyond the Box Score article that discusses this) But that doesn’t nearly explain why the home team does so much better than the visiting team!
- Starting pitchers are more likely to have a terrible first inning – This might be true, but I can’t think of any reason why this would affect visiting starting pitchers more than home starting pitchers. I also made a graph of the home advantage for each number of runs scored for the first and third inning (I picked the third inning because that’s the second-greatest difference between home and visitor):
To me, these look almost exactly the same shape, so it’s not like the first inning has way more 6 run innings or anything.
- This is just random chance – I guess that’s possible, but the effect seems large given that the data has more 130,000 games.
- There’s a bug in my code – I’ve been writing code for 20 years, and let me tell you: this is certainly possible! In fact, I found a bug in handling walkoff innings in the existing runs per inning code after seeing some weird results in this investigation. But it would be weird to have a bug that just affects the bottom of the 1st inning, since it isn’t at the start or end of the game. I also implemented it in both Rust and Python, and the results match. But feel free to check – the Rust version is StatsRunExpectancyPerInningByInningReport in reports.rs, and the Python version is StatsRunExpectancyPerInningByInningReport in parseretrosheet.py.
- This is different between baseball eras – I don’t know why this would be true, but it was easy enough to test out, and the difference is pretty consistent. (see the raw data)
- The fact that home teams are usually better in the playoffs bias this – I think this is a tiny bit true, but I reran the numbers with only regular season games (where the better team has no correlation with whether it’s the home or visiting team) and the difference looks almost exactly the same.
So, in conclusion, I don’t know! If anyone has any ideas, I’d love to hear them on this post or on Twitter.
Edit: Ryan Pai suggested on Facebook that the visiting pitcher has to wait a while between warming out and pitching in the bottom of the 1st, which is an intriguing theory!
Odds and ends:
- That top “expected runs per inning” graph has some other neat properties – for example you can see that the 2nd inning is the lowest scoring inning, presumably because something near the bottom of the lineup is usually up.
- Another thing you can see is how robust the home field advantage is. In every inning the home team scores, on average, a little more than the visiting team!
- The graph only shows 8 innings because in the 9th inning things get complicated. For one thing, the bottom of the 9th inning only happens if the home team is behind or tied, which biases the sample somewhat. Also, if the game is tied and the home team hits a leadoff home run, they win the game but lose the opportunity to score any more runs.
- You can also notice the strangeness of the bottom of the 1st inning another way. If you look at the chance that the home team will win when the game is tied, their chances are better at the beginning of the bottom of the 9th than the bottom of the 8th, because they have an extra chance to bat. That advantage gets lower the earlier in the game you go, with one exception. In the bottom of the 1st, the home team has ~59% chance to win, but in the bottom of the 2nd that goes down to ~58%! The reason is that if the home team misses their chance to score runs in the bottom of the 1st they’ve missed a big opportunity, apparently!
- The raw report data is here in the GitHub repo.