(this post linked to by This Week in Rust!)
This project came about because we’ve been playing a lot of a Wheel of Fortune game on David’s phone. (the kids have gotten into it, too!) And it’s nice to be able to cheat once in a while 🙂
We had been using some various online crossword clue solvers, which worked OK. But some of them want the clue for the entry, which doesn’t make sense for Wheel of Fortune. And in Wheel of Fortune you often know letters that aren’t in the word, or at least you know if there’s one “t” there aren’t any more “t”‘s.
The first step was figuring out a good word list to use. My first plan was to use a dictionary (hello
/usr/dict/words!), but this would be missing any sort of proper nouns like places and names and whatnot, which are pretty big omissions! So I sat on the idea for a while, until I stumbled upon a blog post that mentioned the Google Books n-grams. And I was like, hey, I looked into these 14 years ago but wasn’t able to use them! (post showing interest, post where I ordered them, post where I realized I couldn’t use them, post where I moved on) But now they’re available for download for free and the licensing terms are more reasonable.
And, boy, they are huge. I downloaded all the 1-grams (just single words), and the resulting text files total 26GB, and have just under 50 million unique words! (the most common one is “the”, used almost 126 billion times!) Since there was so much data, it was a good excuse to write a Rust script to process them, since performance mattered a lot. For fun, I wrote a Python script afterwards to compare performance, and here’s what I got:
|Time in seconds|
|Rust (release mode) (original code)||163 (2:43)|
|Rust (release mode) (after using BufRead::read_line –|
see this commit, thanks @djco!)
|Rust (debug mode)||2192 (36:32)|
So the Rust version is over 6x faster than Python! Not a big surprise, but honestly the Rust version was just about as easy for me to write. Although even the Python version is not too bad for a one-time run…but of course, I didn’t get good results the first time I ran either of these scripts, so being able to iterate in a reasonable time is nice! If I had needed more performance I could have parallelized things with rayon, but it wasn’t necessary. And here’s the usual reminder that Rust in debug mode is very, very slow, especially compared to the 13x faster release mode!
The data is fairly messy, between typos and bad OCR, so after some inspection I decided to cut off the words with counts less than 10000. This clearly leaves some words that aren’t real around that boundary (“gelele”, “altaria”, “mavligit”), but there are also some sort of real ones (“thunderheart”, “antiradicalism”) around there and I wanted to err on the inclusive side. This also cuts down the final file size from 126MB (~9.3M words) to 8.5MB (~570k words), which makes the query script run a lot faster. Here’s the final word list – you need to click View Raw to actually see it.
I knew performance of the query script would be even more important (since it’s what runs when you press “Search” on the page), so I wrote that in Rust as well, and on my laptop it runs in under 0.1 seconds, which is good enough!
For the frontend I wrote it in vanilla TypeScript. It seemed simple enough that I didn’t think I needed the power of React, and I’ve been on a bit of an anti-framework kick lately.
I’m in the technical preview for GitHub Copilot, and writing Rust code with it was particularly nice because there are a lot of idioms I can’t remember when I go a while between writing Rust code. (it was also pretty handy for the fairly straightforward TypeScript code)
Anyway, it was a fun project, and has become somewhat of a game in itself because Vanessa likes it when I make up a pattern and excluded letters and she tries to read the results to see if there are words she knows 🙂