See this project on GitHub pages
The data used to make these calculations is obtained from 2 APIs: MLB Gameday and MLB Stats API. The data is used to create a list of batting averages and on-base percentages that the batters are expected to have against the current pitcher. This list will be refered to as the xlineup
and represented by a list of nine ordered pairs, where the values in each pair are Out%
(the percent of plate appearances resulting in outs) and BB%
(the percent of plate appearance resulting in walks). For simplicity, a HBP is considered a type of BB The notation represents the m statistic of the nth batter. The number of outs remaining and the current or due-up batter are also important pieces of data.
The probability of a perfect game is the probability that all remaining batters get out. Therefore, with outs
representing the number of outs recorded, the probability of a perfect game is the product of the probabilities of each batter:
The probability of a no-hitter is a bit more complex. There is a chance of a no-hitter with 0 baserunners, with 1 baserunners, with 2 baserunners, etc. For a no-hitter with n baserunners, there are 27^n permutations for the order of outs and baserunners. To account for every possibility, the program uses a 2D-array. The row number represents the number of outs and the column number represents the number of baserunners. The table is limited to 32 columns, and he first column and row have an index of 0. Although more baserunners are theoretically possible, the probability of this is so small that it is negligible. The notation array[X][Y]
represents the cell in row X and column Y. The value in array[X][Y]
is the probability that the game reaches X outs with no hits and Y baserunners. The array is initialized by setting the cell representing the current situation equal to 1. Then, values are passed from cell to cell. Iteratively set cells using the following formulas:
Below is an example of an array with 2 rows and 2 columns, where Out%
=0.7 and BB%
=0.1:
[ P = 1 ] | [ P = 1*0.1 ] | |
[ P = 1*0.7 ] | [ P = 0.7*0.1+0.1*0.7 ] |
After this process fills the entire array, array[27]
will contain the probabilities of reaching the end of the game without allowing a hit for each number of baserunners. Simply sum this row to get the probability of a no-hitter.
The first step in creating the xlineup
is to calculate the batters' and pitcher's expected batting average and on-base percentage for future plate appearances. With small sample sizes, a player's statistics can be wildly different than their true performance level. For example, if a pitcher does not allow a hit in his first appearance of the season, his .000 batting average against would imply that a no-hitter his following performance is certain. For a more realistic value, a player's season statistics are regressed to their projected season statistics.
Regression to the mean is used to explain how larger sample sizes will be closer to an expected value than some small samples. This can be done by adding a fixed number of average values to the sample. The larger the real sample, the more reliable it is. Since some players have a higher average performance than others, the expected value for a player is their projected season statistics.
Fangraphs has multiple projection systems available. This program uses the Steamer projections. One issue is that the pitching projections do not have a batting average against nor a on-base percentage against. Using the provided stats WHIP
and BB/9
, I created estimations for AVG
and OBP
with strong correlations. My initial estimations were estOBP = WHIP/(WHIP+3)
and estAVG = (WHIP-BB9/9)/((WHIP-BB9/9)+3)
, but I found these did not quite have a slope of 1 when plotted against their actual statistics, so I multiply these each by a constant: estOBP = 0.9512*WHIP/(WHIP+3)
and estAVG = 0.9652*(WHIP-BB9/9)/((WHIP-BB9/9)+3)
.
The next step in creating the xlineup
is to find the batter's predicted satistics against a certain pitcher. A simple way to do this is using Bill James's Log5 formula. P = (xy/z) / (xy/z+(1-x)(1-y)/(1-z))
, where x is the batter's stats, y is the pitcher's stats, and z is the league average stat. This formula is used to get the estimated AVG and estimated OBP in a given matchup.
The probability of an out and the probability of a walk (or hit-by-pich) must be calculated from the AVG and OBP. The P('out'), or Out%
, is simple; 1-OBP
. The P('walk'), or BB%
, is a little more complex. This is OBP-P('hit')
. Notice that AVG does not equal P('hit'); AVG=P('hit' | 'at-bat')
, while we want P('hit' | 'plate appearance')
. I found that P('hit')=AVG*(1-OBP)/(1-AVG)
. So BB% = OBP-AVG*(1-OBP)/(1-AVG)
. A chance of an error 1-FLD%
is added to the BB%
, where FLD%
is the league-wide fielding percentage.
The following aspects are not currently considered in the calculations, but could be considered in the future:
- Double plays
- Caught stealing/stolen bases
- Extra innings/scoring runs
- Pitcher fatigue
- Team specific fielding percentage
- Count-adjusted statistics
- Other head-to-head formulas
- "Hot/Cold" (e.g. if a pitcher hasn't allowed a hit through 7 innings, they are likely performing better than their average, but the calculation uses their season averages)
I also plan to add more detailed information to the user interface, including:
- Pitcher name
- Opponent
- Score
- Inning
- Baserunners
- Outs
- Current/due-up batter