A process, inspired by SRS (Simple Rating System), to find a batter's "true" batting average (and a pitcher's, and a ballpark's, while we're at it).
If a batter gets a hit in a league that has a .250 batting average, we can say that he earned .75 hits above average (HAA). 1 hit minus .25 expected hits = .75 HAA. If he has a hitless at-bat, he earned -.25 HAA. But not all at-bats are equal in difficulty; in reality the batter's "expected" batting average varies depending on the pitcher, ballpark, and other factors.
From the 2023 events csv, the following columns: gid, inning, top_bot, ballpark, batteam, pitteam, batter, pitcher, bathand, pithand, ab, single, double, triple, and home run. To these add a 16th column: 'H', which sums the single, double, triple, and home run columns. Delete all the rows from the All-Star Game.
Now add a ballparks sheet, with the first two columns from the ballparks csv: PARKID and NAME. Label the next three column heads AB, H, and BA. Find the AB and H sum for each ballpark, sort by highest AB, and then calculate the batting avg. for all the parks that hosted an MLB game in 2023. There were 30 regular parks, each with over 5,000 at-bats, plus three special "guest" parks -- in London, Mexico City, and Williamsport, PA -- that each hosted a game or two.
Now create a batters sheet from the first three columns from the biofile csv (id, lastname, and usename), and calculate the AB, H, and BA for 2023 batters. 656 batters had at least one at-bat in 2023, led by Marcus Semien with 746 regular and postseason at-bats.
Next copy and paste that sheet onto a new sheet called pitchers, and do the same thing for pitchers. 864 pitchers pitched at least one at-bat in '23, led by Zac Gallen with 915.
And finally, a sheet for MLB totals: 41,501 hits in 167,165 at-bats, a .248 average.
Now that I'm done setting up, my first step will be to find the expected batting average (xBA) of each at-bat based on the raw batting averages of the hitter and pitcher involved.
In the 8th inning of the Angels' 2023 home opener against the Blue Jays, Shohei Ohtani, who ended up with a .304 batting average, led off against Erik Swanson, who ended up with a .213 batting average allowed.
The simplest way to find the xBA for the at-bat is to add their batting averages and subtract the MLB average: .304 + .213 - .248 = .269.
Another way is to multiply and divide: .304 x .213 / .248 = .261.
The most sophisticated option is the odds ratio method: the batter's odds times the pitcher's odds divided by the league-average odds equals the odds the at-bat will result in a hit.
(.304 / .696) x (.213 / .787) / (.248 / .752) = .358 hits to 1 out.
.358 / (.358 + 1) = .264 xBA
But the odds ratio method won't work on this initial iteration, because some batters and pitchers with only an at-bat or two have a 1.000 BA, which will result in a divide-by-zero error. (Once I've gone around the block once and gotten a regressed-to-the-mean true-talent batting average estimate for every batter and pitcher, I can use it.)
Besides, I want to include the platoon (dis)advantage in my calculations. Using the simple add/subtract method, I take the platoon BA, add the batter's BA and the pitcher's BA, and subtract the batter's league-average BA and the pitcher's league-average BA. (The league-average BA is the same for everybody: .248.)
Left-handed batters hit .251 against right-handed pitchers in 2023, so the xBA for the Ohtani/Swanson matchup would be: .251 + .304 + .213 - .248 - .248 = .271.
Since Ohtani hit a double in this particular at-bat, the at-bat produced 1 - .271 = .729 HAA for Angel Stadium. That's right, step 1 is finding true(r) batting averages for ballparks.
Overall, the Big A had -17 HAA in 5,515 at-bats. To find the park's true batting average (trueBA), multiply its at-bats by the league-average BA, add its HAA, and divide by its at-bats.
(5,515 x .248 - 17) / 5,515 = .245
Coors Field, unsurprisingly, had the highest trueBA of the 30 regular parks, at .270, while Citi Field had the lowest, .235.
Now a reasonable next step might be to repeat the process, using this first estimate of ballparks' trueBA to find the trueBA for either batters or pitchers.
But before I do that I'm going to tweak this first estimate a little more, which means I first need to get an initial trueBA for all ballparks from all seasons for which we have data...
No comments:
Post a Comment