Thursday, September 26, 2024

A Ballpark's "True-Talent" Batting Average (First Pass)

In the Angels' 2023 home opener, Toronto's George Springer led off the game against pitcher Patrick Sandoval. What was the expected batting average for the at-bat? That is, what was the percentage chance that an official at-bat (a hit, out, or reach-on-error) would end in a hit?

Combining regular and postseason numbers, Springer ended up hitting .257 that year. (That's probably not quite his "true-talent" batting average, which is what he would hit if he could somehow have a million at-bats in a single season, thereby eliminating luck as a contributing factor. And it's not even his adjusted batting average, which adjusts for the difficulty of his opponents and ballparks relative to the league average. But we have to start somewhere, and it's with raw hits per at-bat.) So it would be reasonable to assume that there was about a 25.7% chance the at-bat would end in a hit.

Sandoval, the pitcher, ended up with almost the same batting average allowed: .256. This confirms there is probably a 25.6% or 25.7% chance of a hit, right?

Well, that depends on the major league batting average, the average for all MLB. It was .248 last year, which means Springer and Sandoval were both slightly more hit-prone than average (good for Springer, bad for Sandoval). If Springer hit .257 against average pitching, and Sandoval allowed .256 against average hitting (again, we're assuming the true talents of both men in average environments, for now), and average is .248, then the batting average for the match-up would be higher than either of their individual averages.

There's a simple calculation for that: the batter's BA plus the pitcher's BA, minus the league BA. In this case, where the numbers are close to average, you can use this simple add/subtract formula or the odds ratio method and they both get you the same answer: .265.

.257 + .256 - .248 = .265

That's the expected batting average, or xBA, for the Springer/Sandoval matchup. Since Springer flied out, he got zero hits, which was .265 less hits than expected, or -.265 hits above average (HAA). If he'd gotten a hit, it would've been 1 - .265 = .735 more hits than expected, or .735 HAA.

Here's the rest of the 1st inning:

Batter   Pitcher batBA pitBA mlbBA  xBA Result   AB H   HAA
Springer Sandoval .257  .256  .248 .265 Flyout    1 0 -.265
Bichette Sandoval .309  .256  .248 .317 Double    1 1  .683
Guerrero Sandoval .263  .256  .248 .270 Groundout 1 0 -.270
Chapman  Sandoval .238  .256  .248 .246 Strikeout 1 0 -.246
Ward     Bassitt  .253  .235  .248 .240 Walk      0 0
Trout    Bassitt  .263  .235  .248 .250 Home Run  1 1  .750
Ohtani   Bassitt  .304  .235  .248 .291 Strikeout 1 0 -.291
Renfroe  Bassitt  .233  .235  .248 .220 Error     1 0 -.220
Lamb     Bassitt  .216  .235  .248 .202 Lineout   1 0 -.202
Drury    Bassitt  .262  .235  .248 .249 Flyout    1 0 -.249
                                        Total     9 2 -.308

In the first inning of the first game at Angel Stadium in 2023 there were 9 at-bats and 2 hits, which were about 0.3 less hits than expected or -0.3 HAA.

For the entire 2023 season, Angel Stadium had 5,515 at-bats, 1,341 hits (a .243 average), and -11 HAA. To get the park's adjusted hits, multiply its at-bats by the league average and add its HAA:

5,515 x .248 - 11 = 1,358

Divide the adjusted hits back into the at-bats to get the adjusted batting average: .246 for the Big A, slightly below average, but a little higher than its simple batting average of .243.

Here are the numbers for all 33 ballparks that hosted MLB games in 2023, sorted by adjusted BA:

Ballpark                   AB HAA adjBA
Estadio Alfredo Harp Helú 149  14  .342
Coors Field             5,705 127  .270
Fenway Park             5,613  77  .262
Kauffman Stadium        5,564  65  .260
LoanDepot Park          5,482  45  .256
Globe Life Field        5,939  37  .255
Nationals Park          5,490  34  .254
London Stadium            139   1  .253
Busch Stadium           5,491  26  .253
Guaranteed Rate Field   5,556  18  .251
Comerica Park           5,535  16  .251
Minute Maid Park        5,903  10  .250
Target Field            5,715  10  .250
Yankee Stadium          5,369   3  .249
Wrigley Field           5,492   2  .249
Chase Field             5,947   0  .248
American Family Field   5,485  -4  .248
PNC Park                5,450  -5  .247
Truist Park             5,647  -7  .247
Angel Stadium           5,515 -11  .246
Oracle Park             5,412 -13  .246
Camden Yards            5,553 -21  .244
Progressive Field       5,468 -27  .243
Tropicana Field         5,555 -29  .243
Petco Park              5,205 -29  .243
Great American Ballpark 5,510 -34  .242
Citizens Bank Park      6,011 -40  .242
Dodger Stadium          5,546 -37  .242
Rogers Centre           5,463 -38  .241
Oakland Coliseum        5,438 -47  .240
T-Mobile Park           5,427 -69  .236
Citi Field              5,327 -69  .235
Bowman Field               64  -2  .224

Now that I have initial estimates of adjusted BA for all 1,890 ballpark-seasons from 1950 through 2023, I can take a first crack at their "true-talent" BA. For that I use the six surrounding years of each ballpark-season -- the three years before and the three years after -- to try to "project" that park's HAA, and then compare the projection to its actual HAA. I actually made two versions of this, one where I use all six surrounding years and one where I pretend not to know what happened in the three years after (which isn't pretending for the 2021-23 seasons). So I'll demonstrate the latter using Angel Stadium 2023 as my example again:

Year Ballpark         AB HAA
2020 Angel Stadium 2,118  49
2021 Angel Stadium 5,520  -4
2022 Angel Stadium 5,444  27
Total             13,082  73
Prorated           5,515  31
2023 Angel Stadium 5,515 -11

There were 13,082 at-bats and 73 HAA at Angel Stadium in the three years before 2023. Prorate that to the 5,515 at-bats Angel Stadium had in 2023 and I get 31 HAA, which is the projected HAA for Angel Stadium in 2023. The Big A actually had -11 HAA in 2023, an "error" of 42 hits.

Add up all the differences between actual and projected HAA for all 1,848 ballpark-seasons from 1953 to 2023, and I get 49,405. The idea is to lower this number as much as possible, and I can do that by weighting seasons by their proximity to the target year: if I'm projecting 2023, then 2021 gets less weight than 2022, and 2020 gets less weight than 2021. I can also regress to the mean by adding a number of average (0 HAA) at-bats, which provides ballast (and a sanity check to some of those "special guest" parks that only hosted a game or two).

If I'm only looking at the three previous years, I got the best results with a 25% "decay" rate -- meaning the year before the target year is weighted at 75%, two years before is weighted at 75% of that (56%), and three years before is weighted at 75% of that (42%). And I added 2,900 at-bats of regression, which is a huge amount -- more than half a season's worth for most ballparks.

Year Ballpark      weight   wAB wHAA
2020 Angel Stadium    .42   894   21
2021 Angel Stadium    .56 3,105   -2
2022 Angel Stadium    .75 4,083   21
Regression to the Mean    2,900    0
Total                    10,982   39
Prorated                  5,515   20
2023 Angel Stadium   1.00 5,515  -11

That gets the projected HAA down to 20, the error down to 31, and the total error for all ballpark-seasons (1953-2023) down to 47,362.

There's actually a third version of true-talent BA, where I don't know what happened in the years after OR in the target year; in which case it really is a projection, like if I was writing this in the 2022-23 offseason. Then my work is done: the prorated, weighted HAA (20 for Angel Stadium 2023) is the true-talent HAA. If Angel Stadium has 5,515 at-bats and the MLB average is .248, I multiply those two, add the 20 HAA, divide by the at-bats, and get a projected .252 true-talent batting average, or true BA, for Angel Stadium 2023.

Now, back to the version where I know what happened in the target year. To get true HAA, I add the target year's numbers into the total before prorating:

Year Ballpark      weight   wAB wHAA
2020 Angel Stadium    .42   894   21
2021 Angel Stadium    .56 3,105   -2
2022 Angel Stadium    .75 4,083   21
2023 Angel Stadium   1.00 5,515  -11
Regression to the Mean    2,900    0
Total                    16,497   28
Prorated                  5,515    9

The "true talent" of Angel Stadium 2023, as best as I can currently determine it, was 9 HAA in 5,515 at-bats, which works out to a true BA of (5,515 x .248 + 9) / 5,515 = .250.

Instead of four seasons of unweighted data from which to project true BA (25% of which is the target year), I now have the equivalent of about 3.27 years: 0.42 for year-3, 0.56 for year-2, 0.75 for year-1, a full year for the target year, and about 0.54 years' worth of regression. That makes the target year about 30% of the projection, the three previous years about 53%, and the remaining 17% regression to the mean. (That's all based on an average ballpark-season of 5,331 at-bats. Obviously, partial seasons (like 2020), or part-time parks, or parks with multiple full-time tenants (like Dodger Stadium 1962-65) changes that math.)

So to recap, if I didn't know what happened in 2023 (but somehow still knew the MLB average would be .248), I would project Angel Stadium to have a .252 true BA. Angel Stadium actually had a .246 (adjusted) BA in 2023. Knowing that, and not knowing what will happen in the 2024-26 seasons, I estimate that Angel Stadium had a .250 true BA in 2023.

Here are the 33 ballparks of 2023 again, now sorted by true BA:

Ballpark                   AB HAA adjBA trueHAA trueBA
Coors Field             5,705 127  .270      92   .264
Kauffman Stadium        5,564  65  .260      49   .257
Fenway Park             5,613  77  .262      47   .257
Estadio Alfredo Harp Helú 149  14  .342       1   .253
Great American Ballpark 5,510 -34  .242      16   .251
Chase Field             5,947   0  .248      15   .251
Globe Life Field        5,939  37  .255      14   .251
PNC Park                5,450  -5  .247      12   .250
LoanDepot Park          5,482  45  .256      11   .250
Angel Stadium           5,515 -11  .246       9   .250
Wrigley Field           5,492   2  .249       6   .249
Nationals Park          5,490  34  .254       5   .249
Busch Stadium           5,491  26  .253       3   .249
Truist Park             5,647  -7  .247       3   .249
London Stadium            139   1  .253       0   .248
Guaranteed Rate Field   5,556  18  .251       0   .248
Minute Maid Park        5,903  10  .250      -2   .248
Oracle Park             5,412 -13  .246      -2   .248
Comerica Park           5,535  16  .251      -2   .248
Camden Yards            5,553 -21  .244      -5   .247
Target Field            5,715  10  .250      -7   .247
Bowman Field               64  -2  .224       0   .247
Dodger Stadium          5,546 -37  .242      -8   .247
Citizens Bank Park      6,011 -40  .242      -9   .247
American Family Field   5,485  -4  .248      -9   .247
Progressive Field       5,468 -27  .243     -15   .245
Yankee Stadium          5,369   3  .249     -20   .245
Rogers Centre           5,463 -38  .241     -22   .244
Oakland Coliseum        5,438 -47  .240     -25   .244
Tropicana Field         5,555 -29  .243     -33   .242
Petco Park              5,205 -29  .243     -34   .242
Citi Field              5,327 -69  .235     -44   .240
T-Mobile Park           5,427 -69  .236     -53   .238

Estadio Alfredo Harp Helú in Mexico City, with only two games played there (and none before last year), has its .342 adjusted BA regressed all the way down to a .253 true BA, which still makes it the 4th-most hit-conducive park in baseball last year, by my estimates.

Great American Ballpark played like a pitcher's park in 2023 after several years in a row of HAA well into the positive. Once the previous seasons are accounted for, GABP jumps nine points from its adjusted BA (.242) to its true BA (.251).

Now for the version where I know what happened in the three years before, the target year, AND the three years after. The parameters are different. Because I have six surrounding years of data for most parks instead of only three, the "decay" rate is a little higher, 35%, which means each individual surrounding season carries less weight compared to the target season. Year-1 and year+1 are weighted at 65%, year-2 and year+2 at 42%, and year-3 and year+3 at 27%. I also need much less regression, only 1,100 at-bats, or about one-fifth of a season for most parks. That gives me the equivalent of about 3.9 seasons' worth of data, 26% of which is the target year, 69% the surrounding years, and 5% regression.

I'll demonstrate using 1998 Coors Field, which was the park's fourth season (and its first hosting the All-Star Game), and in the middle of the seven years of pre-humidor Coors:

Year Ballpark     weight   wAB wHAA
1995 Coors Field     .27 1,467   50
1996 Coors Field     .42 2,482  108
1997 Coors Field     .65 3,778   95
1998 Coors Field    1.00 5,793  168
1999 Coors Field     .65 3,863  133
2000 Coors Field     .42 2,469   93
2001 Coors Field     .27 1,579   49
Regression to the Mean   1,100    0
Total                   22,532  698
Prorated                 5,793  179

Coors Field 1998 had 179 true HAA. The MLB average in '98 was .266, so Coors Field's true BA was (5,793 x .266 + 179) / 5,793 = .297. That's the 6th-highest figure of 1950-2023, behind only the other five Coors Field seasons of 1995-2000.

Now let's back up a couple years and look at 1996, Coors Field's second season, to demonstrate again how true BA changes from one version to another as I get more context. Coors Field '95, its inaugural season, had 5,343 at-bats, 183 HAA, and a .301 adjusted BA (the latter two figures were both the highest since at least 1950 up to that point). If I'm projecting 1996, I weight those numbers at 75% (4007 at-bats and 137 HAA), then add the 2,900 at-bats of regression. That's 137 HAA in 6,907 at-bats, which prorates to 117 HAA in the 5,875 at-bats Coors had in '96. Multiply the 5,875 at-bats by the .270 MLB average in '96 and add the 117 HAA, and I get a .289 projected true BA for Coors Field 1996.

Coors Field actually had 256 HAA and a .313 adjusted BA, both of which are the highest figures of 1950-2023 (except for a few guest parks with higher adjusted BAs, like Estadio Monterrey that same year with a .326 adjusted BA in 220 at-bats). Add the '96 numbers to the weighted '95 and regression totals, and I get 393 HAA in 12,782 at-bats, which prorates to 181 HAA in 5,875 at-bats, and a .301 true BA.

After three more years, I have weighted numbers for 1997 (3,778 AB and 95 HAA), 1998 (2,448 AB and 71 HAA), and 1999 (1,632 AB and 56 HAA), as well as new weights for 1995 (3,473 AB and 119 HAA). In this case it doesn't change the results much; the 1997-99 seasons only confirmed what we already conjectured about Coors Field from the 1995-96 seasons. Add all those to the unweighted numbers for '96, plus 1,100 at-bats of regression, and I get 18,306 AB and 598 HAA, which prorates to 192 HAA and a .302 true BA, both of which are (as you probably guessed by now) the highest figures of 1950-2023. And that's my best estimate (for now) of ballparks' true-talent batting average.

At the other end of the spectrum, Safeco Field 2000 had the fewest true HAA, -87, and Candlestick Park '68 had the lowest true BA, .225.

Next: true-talent batting average for PITCHERS.

Tuesday, September 3, 2024

Ballpark True Batting Average Part 1 (Update)

I decided to drop the platoon matchup. In 2022 switch hitters hit .231 as a group (against a dismal-enough MLB average of .242) and it doesn't seem fair to give bad hitters a handicap (or good hitters a penalty). I could adjust the averages up or down so each group collectively hits the MLB average for the sake of platoon environments, but that might open up other cans of worms and seems like more trouble than its worth anyway.

So, here are the formulas:

xBA (expected batting avg.) = batterBA + pitcherBA - mlbBA

HAA (hits above average) = H - xBA

trueH = AB x mlbBA + HAA

trueBA = trueH / AB

When Shohei Ohtani (.304 BA) led off the 8th inning of the Angels' 2023 home opener against Toronto's Erik Swanson (.213 BA allowed), the xBA was .304 + .213 - .248 = .269.

Ohtani doubled, so the at-bat generated 1 - .269 = .731 hits above average.

Angel Stadium had -11 hits above average in 2023, in 5,515 at-bats, so it had 5,515 x .248 - 11 = 1,358 "true" hits.

And therefore the Big A's trueBA was 1,358 / 5,515 = .246.

Dropping the platoon matchup hardly changed the parks' trueBA at all: the highest of the thirty regular parks was still Coors Field, .270, while the lowest was still Citi Field, .235.

In 2022, the highest was Coors Field again, at .261, while the lowest was Petco Park, .229.

Monday, September 2, 2024

"True" Batting Average of Ballparks, Part 1

A process, inspired by SRS (Simple Rating System), to find a batter's "true" batting average (and a pitcher's, and a ballpark's, while we're at it).

If a batter gets a hit in a league that has a .250 batting average, we can say that he earned .75 hits above average (HAA). 1 hit minus .25 expected hits = .75 HAA. If he has a hitless at-bat, he earned -.25 HAA. But not all at-bats are equal in difficulty; in reality the batter's "expected" batting average varies depending on the pitcher, ballpark, and other factors.

From the 2023 events csv, the following columns: gid, inning, top_bot, ballpark, batteam, pitteam, batter, pitcher, bathand, pithand, ab, single, double, triple, and home run. To these add a 16th column: 'H', which sums the single, double, triple, and home run columns. Delete all the rows from the All-Star Game.

Now add a ballparks sheet, with the first two columns from the ballparks csv: PARKID and NAME. Label the next three column heads AB, H, and BA. Find the AB and H sum for each ballpark, sort by highest AB, and then calculate the batting avg. for all the parks that hosted an MLB game in 2023. There were 30 regular parks, each with over 5,000 at-bats, plus three special "guest" parks -- in London, Mexico City, and Williamsport, PA -- that each hosted a game or two.

Now create a batters sheet from the first three columns from the biofile csv (id, lastname, and usename), and calculate the AB, H, and BA for 2023 batters. 656 batters had at least one at-bat in 2023, led by Marcus Semien with 746 regular and postseason at-bats.

Next copy and paste that sheet onto a new sheet called pitchers, and do the same thing for pitchers. 864 pitchers pitched at least one at-bat in '23, led by Zac Gallen with 915.

And finally, a sheet for MLB totals: 41,501 hits in 167,165 at-bats, a .248 average.

Now that I'm done setting up, my first step will be to find the expected batting average (xBA) of each at-bat based on the raw batting averages of the hitter and pitcher involved.

In the 8th inning of the Angels' 2023 home opener against the Blue Jays, Shohei Ohtani, who ended up with a .304 batting average, led off against Erik Swanson, who ended up with a .213 batting average allowed.

The simplest way to find the xBA for the at-bat is to add their batting averages and subtract the MLB average: .304 + .213 - .248 = .269.

Another way is to multiply and divide: .304 x .213 / .248 = .261.

The most sophisticated option is the odds ratio method: the batter's odds times the pitcher's odds divided by the league-average odds equals the odds the at-bat will result in a hit.

(.304 / .696) x (.213 / .787) / (.248 / .752) = .358 hits to 1 out.

.358 / (.358 + 1) = .264 xBA

But the odds ratio method won't work on this initial iteration, because some batters and pitchers with only an at-bat or two have a 1.000 BA, which will result in a divide-by-zero error. (Once I've gone around the block once and gotten a regressed-to-the-mean true-talent batting average estimate for every batter and pitcher, I can use it.)

Besides, I want to include the platoon (dis)advantage in my calculations. Using the simple add/subtract method, I take the platoon BA, add the batter's BA and the pitcher's BA, and subtract the batter's league-average BA and the pitcher's league-average BA. (The league-average BA is the same for everybody: .248.)

Left-handed batters hit .251 against right-handed pitchers in 2023, so the xBA for the Ohtani/Swanson matchup would be: .251 + .304 + .213 - .248 - .248 = .271.

Since Ohtani hit a double in this particular at-bat, the at-bat produced 1 - .271 = .729 HAA for Angel Stadium. That's right, step 1 is finding true(r) batting averages for ballparks.

Overall, the Big A had -17 HAA in 5,515 at-bats. To find the park's true batting average (trueBA), multiply its at-bats by the league-average BA, add its HAA, and divide by its at-bats.

(5,515 x .248 - 17) / 5,515 = .245

Coors Field, unsurprisingly, had the highest trueBA of the 30 regular parks, at .270, while Citi Field had the lowest, .235.

Now a reasonable next step might be to repeat the process, using this first estimate of ballparks' trueBA to find the trueBA for either batters or pitchers.

But before I do that I'm going to tweak this first estimate a little more, which means I first need to get an initial trueBA for all ballparks from all seasons for which we have data...

The Hack Wilson Project

In Aaron Judge's first plate appearance of the 2024 season, March 28th in Houston, he came up with a runner on 1st and one out. In the 1...