March Madness 2021: Simulating a Bracket, Part 3

Welcome to Part 3! If you missed Parts 1 + 2, you can catch up by starting with the story here:
In the series finale Iāll get into the storage limitations I ran into with the free tier of MongoDB, and the little trick I implemented so the end user can still see a full simulated box score for every game in their simulated bracket.
How to store over 2,000,000 basketball game simulations
In Part 1 I mentioned that we needed to do a little over 2M simulations to have 1,000 trials for each of the possible matchups in the 68-team NCAA basketball tournament. One can imagine this could take up a lot of space! One of the original conditions of this project was to use my already-working website API at api.tarpey.dev, which uses a free-tier MongoDB with 512MB of storage.
I had the ambitious idea to save every playerās individual stats for every simulation, so they could be viewed on-demand by the end user. But sure enough, this ended up putting me in the world of 10GB+ of storage requiredā¦the MongoDB Atlas free tier only allows for 512MB! It looked like I would have to give up the individual player stats dreamā¦until I cooked up another way.

The end user can select from four āflavorsā of brackets, ranging from Vanilla (which will build a bracket out of games near the median of each 1,000 simulations) to MAX SPICE (where any of the 1,000 simulations could be selected at any node of the bracket). Instead of saving the entire pool of 1,000 simulations for each matchup, we could instead do something like this:
- Save enough simulations to the database (about 20) so we have at least a few different box scores to choose from in each quantile.
- Before throwing away the other 980 simulations, use them to build a supplemental document of distribution info. This document could look at the 1,000 simulations and say āif the user chooses mild, team X has a 35% chance to win, team Y 65%.ā Repeat for each flavor and matchup.
- These supplemental documents can also save the range of scores that are eligible to be chosen for each flavor (for example, if the user wants to see a Mild box score for two evenly matched teams X vs. Y, maybe we should return a simulation where the final margin of victory was somewhere between -3 and +3 pointsā¦but if the user wants to see a SPICY box score for X vs. Y, we can use a simulation where the margin of victory was anywhere between -40 and +40 points!)
Hereās an example of one of these āsupplemental documentsā:
{
"away_key": "BOISE",
"home_key": "WKENT",
"home_win_chance_max": 0.38,
"home_win_chance_median": 0,
"home_win_chance_medium": 0.35074626865671643,
"home_win_chance_mild": 0.2814814814814815,
"max_margin_bottom": -43,
"max_margin_top": 38,
"median_margin_bottom": -8,
"median_margin_top": -1,
"medium_margin_bottom": -21,
"medium_margin_top": 14,
"mild_margin_bottom": -14,
"mild_margin_top": 6,
"season": "2021"
}
With all of this information saved, we can:
- Use the
home_win_chance
fields (which are based on all 1,000 simulations) to decide which team wins each matchup at runtime when the user requests a bracket - Use the
margin_top
andmargin_bottom
fields to query the database and identify a valid box score for each of the 67 matchups in the bracket - Throw away the other 980 simulations and save a bunch of space! (Although I would love to do more analysis on the full 1,000ā¦or moreā¦but weāll save that for next year!)
Constructing a single query that would only touch the database once to retrieve a box score for 67 different matchups was quite complicated, but I canāt recommend enough the ODMantic package for working with MongoDB on an async, ODM basisā¦it made things a lot easier! I wrote about doing this in a previous post, and I used a similar method for building the query of all the box scores in the userās bracket.
Final Model
In the end, hereās how this yearās model ended up, relative to our goals from Part 1:
- I wanted to write the code in Python so I could easily implement this new model on my current website at tarpey.dev/autobracket. In the end, I ended up decoupling the front-end and back-end of my website to create a nicer interface. All of the Python driving the model is now at api.tarpey.dev, and the front-end at tarpey.dev is being rebuilt with React and Next.js now (which Iām really enjoying so far!).
- I wanted to create a model that was realistic enough to be worthy of simulating an actual basketball game. You can be the judge of this, but Iām really liking the results Iām getting from the brackets Iāve generated so far! Theyāre a nice mix of expected outcomes and possible upsets, so I think weāve done a decent job capturing the magic of March.
- I wanted to make simplifying assumptions where it was likely that the additional complexity wouldnāt really make a difference in the outcome. Mission accomplished (and we left room to improve next year).
- I wanted realistic box scores to āfall outā of the model. The only thing that gives me pause here is that some outliers are a little less realistic than I would likeā¦a follow-up for next yearās model will be to ānarrowā the distribution of outcomes a little bit. Generally speaking, I think the āMediumā flavor (which uses the middle 80% of simulations in the distribution) is the sweet spot if youāre using the model!
- I needed to be able to run a BUNCH of simulations. Mission accomplished, with a caveat! Despite the amazing performance gain from implementing NumPy arrays for running simulations in parallel, I think the way I was slinging data around could have been more efficient. (I didnāt mention this yet, but I used Google Cloud Tasks to queue up all of the simulations this year and send them to the API at Google Cloud Run for processingā¦maybe a good subject for a future post!)
So, can I make a bracketĀ now?
Starting March 15th (the Monday morning after Selection Sunday), youāll be able to generate valid brackets based on my model! Just visit the link below and make as many brackets as you like. If you run into any bugs or issues, feel free to reach out to me.
Iāll be collecting data for each bracket thatās made, and once the dust has settled on this yearās March Madness I hope to do some analysis to see how the model performed. Stay tunedā¦
Thanks for reading (and checking out my model)! As always, feel free to share your thoughts, questions, and suggestions.