March Madness 2021: Simulating a Bracket, Part 2

Brackets all over the office during March Madness 2017.
Some of my March Madness brackets from 2017.

Welcome to Part 2! If you missed Part 1, check out the story below for some background on what we’re doing here:

In this post we’ll get more technical. I’ll explain how I implemented this year’s model using arrays, the exponential distribution, and a whole lot of random numbers…

Model Flow

A basketball game consists of a string of ā€œpossessionsā€. As a team, you’re either in one of two worlds:

  • You don’t have the ball. This means you’re gonna try to get the ball by stealing it, forcing a turnover, or blocking a shot. You’re also going to try not to commit a foul that could give the other team free throws.
  • You have the ball. This means you’re gonna try to avoid all of the above outcomes and shoot the ball (and hopefully make the shot!). If you do make the shot, there’s a chance one of your teammates assisted the shot. If you missed the shot, the game will likely enter a rebound situation.

These two bullets pretty much describe the heart of the model! We simply loop through these possibilities, one possession at a time, simulating outcomes and updating stats until the game clock runs out. Of course, anyone familiar with basketball will realize I left out some nuance here, and that’s intentional. Things like fouls committed by the offense or out-of-bounds plays aren’t modeled precisely, because they fall in the ā€œcomplex and not necessarily valuableā€ bucket we talked about in Part 1. (We also don’t really have data that granular, so it would be hard to model anyway!)

One item we haven’t really talked about yet is just how long each possession is in a basketball game. One can imagine there’s a lot of variance in this number, but we do know that the maximum amount of time a possession can last is 30 seconds (the length of time on the shot clock). We also know thanks to the statistics at kenpom.com approximately how fast/slow each team has played their games this season. As an example:

  • Connecticut’s AdjT stat so far this year is 66.1, and Georgia’s is 73.6. So if these two teams played a game this year, we would expect 139.7 (about 140) possessions to occur in a 40 minute game.
  • A 40 minute game is 2,400 seconds…2,400 / 140 tells us our mean possession time in the model should be about 17 seconds.
  • To introduce some volatility to this number, I chose to use a normal distribution with standard deviation = 4. The properties of the normal distribution tell us that a randomly generated shot clock using these parameters will be between 5 and 29 seconds almost 100% of the time. This seems good enough to me (although there is definitely an opportunity to do some more science around distribution of possession length in the future).
  • I actually ended up setting 15 seconds as the mean. In early simulation runs, I noticed that 17 seconds ended up getting me to a lower number of possessions than I was looking for…and I eventually realized that this is because a possession is only counted when the ball goes over to the other team! As an example, it’s possible for a team to use up 30 seconds, miss a shot, recover their own missed shot, and then use up even more time. Our 17 second calculation assumed that the ball always changes hands at the end of that 17 seconds. Rather than over-complicate the distribution, I just moved the mean to the left a bit to compensate for these extra-long possessions.

Selecting Players

The ā€œhome baseā€ for our model is a DataFrame we’re calling the matchup_df. Here we’ll be able to track simulated stats for each player in each game using one of my favorite Pandas functions, pd.DataFrame.update().

For each possession, we need to select 10 players (5 from each team) to play out all of the possible scenarios I described above. Once we do this, we can use our trusty update statement to credit each player on the floor with the time they spent playing the game:

# pick 10 players for the current possession based on average time share
# 5 from the away team
away_team_sample = rng.choice(
    away_minute_weights[:, 0],
    size=5,
    replace=False,
    p=away_minute_weights[:, 1],
)

# 5 from the home team
home_team_sample = rng.choice(
    home_minute_weights[:, 0],
    size=5,
    replace=False,
    p=home_minute_weights[:, 1],
)

# this list comprehension ended up not being necessary, since we're assuming
# that every game in the "n" number of simulations has the same 10 players
# on the floor at any given time.
on_floor_all_sims = np.array(
    [
        np.concatenate(
            (
                np.isin(
                    away_minute_weights[:, 0],
                    away_team_sample,
                )
                * 1,
                np.isin(
                    home_minute_weights[:, 0],
                    home_team_sample,
                )
                * 1,
            ),
        )
        for x in range(sample_size)
    ]
).flatten()

matchup_df["player_on_floor"] = on_floor_all_sims
# copy here to avoid a later settingwithcopywarning
on_floor_df = matchup_df.loc[matchup_df.player_on_floor == 1].copy()

# add the possession length to the time played for each individual on the floor and update.
# numpy array is expanded 10x so each player of the 10 on the floor can get their time
on_floor_df["sim_seconds"] += np.repeat(possession_length, 10)
matchup_df.update(on_floor_df)

The ā€œupdateā€ statement allows us to update the relevant simulated stats in our ā€œhome baseā€ matchup_df at any time! The Pandas documentation describes it best:

The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated.

The Exponential Distribution

I wanted to use each player’s total season stats to come up with a ā€œrate parameterā€ for how often they score points, collect rebounds, steal the ball, etc. This idea was a perfect fit for the properties of the exponential distribution…we could simply look at (seconds played / choose your stat) to come up with the average amount of time it takes a player to do that particular thing. We can then compare this with the amount of time they just spent on the floor in our simulation to calculate the probability that the given event will occur in that possession!

The snapshot of code above ended up being a nice template for pretty much all of the stats I needed to model. The logic for some other stats has some extra nuance in certain important cases (for example, the shooting logic has extra code to decide whether or not a 2pt. or 3pt. shot was taken).

An extremely recent addition (like, this morning) to the model was the idea of ā€œrelative strengthā€ in line 40 above. I was getting really skewed game outcomes in situations where a team that played a weak schedule (lower tier teams) all year and generated really nice looking stats was consistently beating a team that played a strong schedule (higher tier teams) all year, but their stats weren’t quite as flashy (probably because their games were relatively harder!). Luckily, Kenpom has a stat for everything…we can simply compare strength of schedule between two teams and make an appropriate adjustment to the random number generator for each event.

So, is everything an arrayĀ now?

Pretty much! In the code example above, possession_length is an array of shot clocks, one for each game. (sample_size is the variable that determines how many games are being simulated simultaneously, i.e. how long each array needs to be). There’s also a large ā€œtrackingā€ array that holds state data for each game (basically a bunch of 1s and 0s that flip to on/off as necessary):

def steal_distribution(possession_length, defensive_teams, on_floor_df):
    steal_fields = [
        "Steals",
        "Minutes",
        "sim_steals",
    ]
    # dropping the PlayerID level allows us to slice out only the teams on defense in each sim.
    defensive_index = [(sim, team) for sim, team in enumerate(defensive_teams)]
    steal_probs = on_floor_df.loc[
        on_floor_df.index.droplevel("PlayerID").isin(defensive_index), steal_fields
    ]
    # let's try truly modeling as an exponential distribution.
    # first calculate rate parameter (1 / theta or beta), aka
    # (1 / average amount of seconds between two events)
    # factor of 2 because you can only steal
    # when you're playing on defense!
    steal_probs["steal_chance_exp_theta"] = 1 / (
        2 * steal_probs["Steals"] / steal_probs["Minutes"] / 60
    )
    # now use rate parameter to calculate CDF per player, per possession.
    # x = the percentage of the team's gametime that has elapsed
    # numpy array is expanded 5x so each player of the 5 on this side can get their time
    steal_probs["no_steal_chance"] = np.e ** (
        -1 * np.repeat(possession_length, 5) / steal_probs["steal_chance_exp_theta"]
    )
    # we could also randomly sample an exponential for each player using numpy
    # and see if it's less than the number of possession_seconds. might be
    # interesting to see how this changes the simulation.
    steal_probs["steal_chance"] = 1 - steal_probs["no_steal_chance"]

    return steal_probs

# aggregate probabilities for the five players to come up with the team's chance to steal
team_steal_chances = (
    1 - steal_probs.groupby(level=0).prod()["no_steal_chance"].to_numpy()
)

# RNG to determine if a steal took place in each game.
# defensive strengths is an array of modifiers that we can use
# to raise/lower the chance of success depending on the relative strength
# of each team
steal_turnover_success = rng.random(size=sample_size) - defensive_strengths

# if there's a successful steal, credit the steal and turnover, then flip possession.
# games with steals don't do anything else until the loop restarts for a new possession.
successful_steals = (
    steal_turnover_success < team_steal_chances
)

For the prototype I described in Part 1, all of these variables were simple scalar values! By expanding them to arrays where each value in the array represents a different game, we’re able to operate on many more games in parallel, in a much more scalable fashion.

Next time…

In Part 3 we’ll talk final output. I’ll get into the storage limitations I ran into with the free tier of MongoDB, and the little trick I implemented so the end user can still see a full simulated box score for every game in their simulated bracket. (We might even talk React…it’s still ugly, but I’m learning.) And I’ll link you to the final webpage where you can generate your own bracket for this year!

Thanks for readingā€Šā€”ā€Šas always, feel free to share any questions, thoughts, or perspectives you have, technical or otherwise!