March Madness 2021: Simulating a Bracket, Part 1
Itās March Madness seasonā¦COVID disrupted last yearās tournament, but this year weāre full steam ahead!
In this Part 1 post Iāll just give some background and outline some of the challenges I had to overcome when building this. Later on in Part 2 weāll get more technical and Iāll provide some math/code walkthroughs, and in Part 3 weāll get to storage challengesā¦and maybe even some front end stuff!?
Background
I really didnāt care about college basketball at all until I became attached to my alma materās team, the UConn Huskies. In the 3.5 years I spent in undergrad, we ended up winning the menās national title twice (2011 and 2014)ā¦and donāt even get me started on how dominant the womenās team is.
The first time I went all-in on trying to construct the perfect bracket was 2013ā¦hereās a picture of my wall that year (which gave others in the dorm something to talk about!):

I still have my old Excel models from past years too! What started as a simple āgut checkā model (pick a win probability like 80ā20 depending on the matchup and hit it with a uniform random number) eventually evolved into a manufactured formula based on stats from the excellent kenpom.com.
This year, I wanted to take the next step and really try to model each game as a standalone simulated entity, complete with a full box score for both teams. Part of the beauty of sports and esports is that no matter the relative strength of the two teams going into the game, crazy and unintuitive things can and will happen! (Iāve got the Big East tournament on the TV behind me, and so far 5 of the 6 games played have been upsetsā¦case in point.) I wanted to capture some of that magic in my model somehow, and I think the best way to do this is with some form of Monte Carlo simulation. With enough simulations, a good model should converge on the favored outcome at the median, but there could also be some crazy worlds where the underdog wins in a blowout!
Goals
Letās outline some of the goals of this project:
- I wanted to write the code in Python so I could easily implement this new model on my current website at tarpey.dev/autobracket. This is where the last web-based version of the model was two years ago. (Iām in the process of updating both the front and backend of the site, so check back in a few days for the final appāāāit will look much different!)
- I wanted to create a model that was realistic enough to be worthy of simulating an actual basketball game. For example, Iāve targeted the simulation to be somewhere near the avg. number of possessions you would expect from each of the two teams in the matchup, based on the tempo metric at kenpom.com. I agree with a lot of basketball nerds that this is a pretty important metric at the heart of a basketball game.
- I wanted to make simplifying assumptions where it was likely that the additional complexity wouldnāt really make a difference in the outcome. A good example here is free throws. I havenāt yet implemented precise modeling of each individual free throw. (They are currently modeled individually as make or miss, but in a real basketball game you might be able to get your own missed free throw back if itās the last free throw. The simplifying assumption in this model is that the ball always goes over to the other team after free throws occur.)
- I wanted realistic box scores to āfall outā of the model. The approach I took here is to construct the model as a possession-by-possession simulation. The key assumption here is deciding which 10 players (5 from each team) are currently playing in any given possession. The model uses each playerās distribution of team minutes from the season when randomly sampling the 10 players (so if you played twice as much as your teammate this season for example, my model is twice as likely to put you on the floor for any given possession). The 10 players are resampled for every new possession!
- I needed to be able to run a BUNCH of simulations. The power of Monte Carlo is getting to a large enough sample size that you have actionable simulation data (in the actuarial world this is especially valuable for measuring risk, e.g. āhow much money do we lose in the worst 5% of possible outcomesā).
If you want to read more about simulating basketball games, the article below served as some early inspiration for me! (Thanks Oluwatobi Popoola.)
Initial Prototype
Initially when building the concept, I used a single Pandas DataFrame (representing the eventual final box score of the game) for each simulation. This was a good learning experience for me, as I eventually ran into the limitation of not being able to run as many simulations as I wanted to.
Each single Pandas DataFrame simulation initially ran in about 10 seconds. That doesnāt seem so bad until you consider what the end goal will require:
- We want to run 1,000 simulations per possible matchup in the NCAA tournament
- 68 teams will make the tournament, and depending on how each matchup goes, any of them could play any other team at a given stage of the tournament, meaning there are (67 + 66 + 65 + ⦠+ 2 + 1) = 2,278 different possible matchups we need to simulate BEFORE we can even start making brackets
- 1,000 x 2,278 is 2,278,000 simulations⦠x 10 seconds = about 263 days of simulation!
We donāt have that kind of time! I want to kick off my simulations Selection Sunday night after the brackets come out, and share a working webpage with the world on Monday to make brackets with.
A rewrite was in orderā¦luckily, I had just listened to an episode of the Python Bites podcast titled A Visual Introduction to NumPy. It linked to this amazing post by Jay Alammar that inspired me to attempt to convert the simulation program. Maybe instead of operating on multiple 10-row DataFrames in parallel, we could just have one massive DataFrame thatās 10,000 rows or so (10 players x 1,000 concurrent games). The trick would be converting every essential variable (think things like game clock, who has possession of the ball, all of the different random number generators that determine a successful rebound/shot/steal/assist etcā¦you get the idea) from a scalar value into a NumPy array.
Could we really manage it all? And would it give us the performance boost we needed? (Spoiler: the answer to both is yes!)
In Part 2 Iāll talk about how I was able to start thinking in arrays instead of just DataFrames, which enabled a roughly ~30x performance boost! Iāll also get into exactly how each possession works (the code and the math behind it).
Thanks for readingāāāas always, feel free to share any thoughts or perspectives you have, technical or otherwise!