Adding 2020 baseball games to the win expectancy finder

You can see the results on the win expectancy finder. (although adding one year of data doesn’t change much, it’s the principle of the thing!) Updated apps will be coming soon.

Usually adding a year’s worth of games is a pretty quick task; run the scripts, and update a few things on the web page. (thankfully I made a list of what to do a few years back) But this year was different because of the rule changes in 2020. Not only that but now that I have two versions of the parsing script (the faster one in Rust and the original in Python) I wanted to keep both scripts up to date. And it ended up being quite a journey!

The rule changes in 2020 were:

  • in extra inning games, a runner starts on second base
  • for doubleheaders, the game only went 7 innings instead of 9

This didn’t sound too hard, but it meant I had to add a set of rules by which to parse the game. The first one was pretty easy (since I just had to know whether the game was in 2020 or not), but I was worried about figuring out whether a game was a doubleheader or not. Luckily Retrosheet added the number of innings to their event file format, and in fact also added whether a runner starts on second base in extra innings, which I didn’t discover until later and should probably go back and use!

Once I got all the games parsing in Rust, then the fun began:

  • I took a quick look at the resulting statistics and noticed that the situation at the start of a game (top of the 1st, bases empty, etc.) had around 700 new games, which sounded reasonable, and less than 100 of them had the visiting team winning, which did not! After some thought (and coming back to it the next day), I found and fixed the bug, which had to do with what the final game situation was as opposed to the last actual game situation was; you can see the fix in this commit.
  • So then I made similar changes in Python, and after running the script the results were off by exactly one game. (just like last time; what are the odds?) Anyway, by looking at the differences in the stats file I could see what situations the mystery game went through, and added a special Report type to find the game that went through those situations. It turns out only one playoff game in 2020 went into extra innings, and the Python script was handling that wrong, and it was pretty easy to fix!
  • That led me to discover that the Python script wasn’t throwing an exception by default if it failed to parse a game, which is bad, so I fixed that in this commit. (notice my Rust style of not putting parentheses around if conditions is starting to slip into my Python style…)
  • Running the Python script showed a major difference in the runsperinningstats file – in fact, the Rust script had never been updating it! The fix was a simple copy/paste error, and I made a later change to use “Self” instead of explicit type names to avoid some of these problems in the future.
    • So how did I never notice this before? The way I validated my Rust script when I was developing it was to run it and see if the results differed from what was in git. This has the now-obvious consequence that if the script didn’t do anything, it would seem like it was working! I guess the lesson I took away from this is, don’t do that 🙂
  • Both scripts print out the number of games parsed at the end, and I noticed when I was debugging some of these problems that the numbers were slightly different between Python and Rust. There are 7 games that the scripts can’t parse correctly and I list them explicitly in both scripts (the event files seem wrong to me) so we can skip them, and the Rust script was correctly not counting them while the Python script was counting them.
  • The way to actually update everything is getting ridiculous – as I mentioned above I have it written down, but there are 16 steps to run! I really need to make this easier…

One thought on “Adding 2020 baseball games to the win expectancy finder”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: