March Madness 2021: Simulating a Bracket, Part 1

It’s March Madness season…COVID disrupted last year’s tournament, but this year we’re full steam ahead!

In this Part 1 post I’ll just give some background and outline some of the challenges I had to overcome when building this. Later on in Part 2 we’ll get more technical and I’ll provide some math/code walkthroughs, and in Part 3 we’ll get to storage challenges…and maybe even some front end stuff!?

Background

I really didn’t care about college basketball at all until I became attached to my alma mater’s team, the UConn Huskies. In the 3.5 years I spent in undergrad, we ended up winning the men’s national title twice (2011 and 2014)…and don’t even get me started on how dominant the women’s team is.

The first time I went all-in on trying to construct the perfect bracket was 2013…here’s a picture of my wall that year (which gave others in the dorm something to talk about!):

Brackets all over the wall of my dorm at the University of Connecticut during March Madness 2013. Some Walgreens signage is also intermixed because I was a part-time employee there during school...and I never let the old signage go to waste.
My March Madness wall from 2013.

I still have my old Excel models from past years too! What started as a simple ā€œgut checkā€ model (pick a win probability like 80–20 depending on the matchup and hit it with a uniform random number) eventually evolved into a manufactured formula based on stats from the excellent kenpom.com.

This year, I wanted to take the next step and really try to model each game as a standalone simulated entity, complete with a full box score for both teams. Part of the beauty of sports and esports is that no matter the relative strength of the two teams going into the game, crazy and unintuitive things can and will happen! (I’ve got the Big East tournament on the TV behind me, and so far 5 of the 6 games played have been upsets…case in point.) I wanted to capture some of that magic in my model somehow, and I think the best way to do this is with some form of Monte Carlo simulation. With enough simulations, a good model should converge on the favored outcome at the median, but there could also be some crazy worlds where the underdog wins in a blowout!

Goals

Let’s outline some of the goals of this project:

  • I wanted to write the code in Python so I could easily implement this new model on my current website at tarpey.dev/autobracket. This is where the last web-based version of the model was two years ago. (I’m in the process of updating both the front and backend of the site, so check back in a few days for the final appā€Šā€”ā€Šit will look much different!)
  • I wanted to create a model that was realistic enough to be worthy of simulating an actual basketball game. For example, I’ve targeted the simulation to be somewhere near the avg. number of possessions you would expect from each of the two teams in the matchup, based on the tempo metric at kenpom.com. I agree with a lot of basketball nerds that this is a pretty important metric at the heart of a basketball game.
  • I wanted to make simplifying assumptions where it was likely that the additional complexity wouldn’t really make a difference in the outcome. A good example here is free throws. I haven’t yet implemented precise modeling of each individual free throw. (They are currently modeled individually as make or miss, but in a real basketball game you might be able to get your own missed free throw back if it’s the last free throw. The simplifying assumption in this model is that the ball always goes over to the other team after free throws occur.)
  • I wanted realistic box scores to ā€œfall outā€ of the model. The approach I took here is to construct the model as a possession-by-possession simulation. The key assumption here is deciding which 10 players (5 from each team) are currently playing in any given possession. The model uses each player’s distribution of team minutes from the season when randomly sampling the 10 players (so if you played twice as much as your teammate this season for example, my model is twice as likely to put you on the floor for any given possession). The 10 players are resampled for every new possession!
  • I needed to be able to run a BUNCH of simulations. The power of Monte Carlo is getting to a large enough sample size that you have actionable simulation data (in the actuarial world this is especially valuable for measuring risk, e.g. ā€œhow much money do we lose in the worst 5% of possible outcomesā€).

If you want to read more about simulating basketball games, the article below served as some early inspiration for me! (Thanks Oluwatobi Popoola.)

Initial Prototype

Initially when building the concept, I used a single Pandas DataFrame (representing the eventual final box score of the game) for each simulation. This was a good learning experience for me, as I eventually ran into the limitation of not being able to run as many simulations as I wanted to.

Each single Pandas DataFrame simulation initially ran in about 10 seconds. That doesn’t seem so bad until you consider what the end goal will require:

  • We want to run 1,000 simulations per possible matchup in the NCAA tournament
  • 68 teams will make the tournament, and depending on how each matchup goes, any of them could play any other team at a given stage of the tournament, meaning there are (67 + 66 + 65 + … + 2 + 1) = 2,278 different possible matchups we need to simulate BEFORE we can even start making brackets
  • 1,000 x 2,278 is 2,278,000 simulations… x 10 seconds = about 263 days of simulation!

We don’t have that kind of time! I want to kick off my simulations Selection Sunday night after the brackets come out, and share a working webpage with the world on Monday to make brackets with.

A rewrite was in order…luckily, I had just listened to an episode of the Python Bites podcast titled A Visual Introduction to NumPy. It linked to this amazing post by Jay Alammar that inspired me to attempt to convert the simulation program. Maybe instead of operating on multiple 10-row DataFrames in parallel, we could just have one massive DataFrame that’s 10,000 rows or so (10 players x 1,000 concurrent games). The trick would be converting every essential variable (think things like game clock, who has possession of the ball, all of the different random number generators that determine a successful rebound/shot/steal/assist etc…you get the idea) from a scalar value into a NumPy array.

Could we really manage it all? And would it give us the performance boost we needed? (Spoiler: the answer to both is yes!)

In Part 2 I’ll talk about how I was able to start thinking in arrays instead of just DataFrames, which enabled a roughly ~30x performance boost! I’ll also get into exactly how each possession works (the code and the math behind it).

Thanks for readingā€Šā€”ā€Šas always, feel free to share any thoughts or perspectives you have, technical or otherwise!