Exploring the Lahman Database Pt. I

Play Ball!

I think some of the appeal that baseball has always had for me comes from the data and stats that are intertwined in the game. I grew up playing little league and supporting the Seattle Mariners (sorry Jays, west coast proximity factor). I went to my first MLB game in 2001 at Safeco Field and instantly fell in love with the magic of the big leagues. More than just the hot dogs and garlic fries at the ballpark, I found my attention constantly being drawn to the player’s stats - batting averages, doubles, triples, home runs - displayed prominently on the Jumbotron scoreboard, and the giant LED pitch speed clocks scattered around the park. In 2011, I watched Moneyball with my dad, and like many others, was fascinated by the analytical approach that teams were beginning to take to gain a competitive edge in professional baseball. Although my support for the Mariners has not changed since that first day at the ballpark, the way I began to view stats did. I stumbled upon the world of sabermetrics and began to talk about players in terms of on-base percentage (OBP) and wins above replacement (WAR), instead of the traditional counting stats like hits and RBIs.

Now that I am beginning to learn R, my interest for exploring these stats and creating informative visualizations has been amplified. I figured it would be useful exercise to derive some of the modern baseball statistics like Weighted Runs Created Plus (wRC+) and FIP (Fielding Independant Pitching) from counting stats like hits and walks. There are great packages like baseballr that scrape this information for use but I wanted to have a go at it from first principles for the benefit of my understanding, so that’s what I’ll present here. The lahman package in R does not yet contain data from the 2017 so instead I’ll manually grab the csv files I need from the appropriate github repo. Necessary constants and weights will be obtained from FanGraphs. I haven’t had any experience scraping HTML tables yet so this will be done using the less glamorous export to csv functions on the website. Once I have all of the necessary tables, I use dplyr and its excellent relational table functions to extract and combine all of the information I need. Then comes the big mutate code block where I calculate all of the statistics. For an explanation of some of the more complicated ones, see below. Here is a sample of the lahman ‘Batting’ database.

Lahman Batting Database

## # A tibble: 20 x 22
##    playerID yearID stint teamID lgID      G    AB     R     H  `2B`  `3B`
##    <chr>     <dbl> <dbl> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 blackch~   2017     1 COL    NL      159   644   137   213    35    14
##  2 altuvjo~   2017     1 HOU    AL      153   590   112   204    39     4
##  3 gordode~   2017     1 MIA    NL      158   653   114   201    20     9
##  4 inciaen~   2017     1 ATL    NL      157   662    93   201    27     5
##  5 hosmeer~   2017     1 KCA    AL      162   603    98   192    31     1
##  6 andruel~   2017     1 TEX    AL      158   643   100   191    44     4
##  7 ozunama~   2017     1 MIA    NL      159   613    93   191    30     2
##  8 abreujo~   2017     1 CHA    AL      156   621    95   189    43     6
##  9 lemahdj~   2017     1 COL    NL      155   609    95   189    28     4
## 10 arenano~   2017     1 COL    NL      159   606   100   187    43     7
## 11 ramirjo~   2017     1 CLE    AL      152   585   107   186    56     6
## 12 schoojo~   2017     1 BAL    AL      160   622    92   182    35     0
## 13 vottojo~   2017     1 CIN    NL      162   559   106   179    34     1
## 14 lindofr~   2017     1 CLE    AL      159   651    99   178    44     4
## 15 cainlo01   2017     1 KCA    AL      155   584    86   175    27     5
## 16 murphda~   2017     1 WAS    NL      144   534    94   172    43     3
## 17 garciav~   2017     1 CHA    AL      136   518    75   171    27     5
## 18 jonesad~   2017     1 BAL    AL      147   597    82   170    28     1
## 19 yelicch~   2017     1 MIA    NL      156   602   100   170    36     2
## 20 merriwh~   2017     1 KCA    AL      145   587    80   169    32     6
## # ... with 11 more variables: HR <dbl>, RBI <dbl>, SB <dbl>, CS <dbl>,
## #   BB <dbl>, SO <dbl>, IBB <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>,
## #   GIDP <dbl>
batting_processed <- batting %>%
  left_join(teams %>% select(teamID, yearID, name, BPF), by = c("teamID", "yearID")) %>%
  left_join(wrc_chart, by = c("yearID", "lgID")) %>%
  left_join(woba_weights, by = c("yearID" = "Season")) %>%
  mutate(PA = AB + BB + IBB + SH + SF + HBP,
         TB = H + `2B` + 2 * `3B` + 3 * HR,
         BA = ifelse(AB > 0, round(H/AB, 3), NA), 
         OBP = ifelse(AB > 0, round((H + BB + IBB + HBP)/PA, 3), NA), #ifelse statement deals with AB = 0 case
         SA = ifelse(AB > 0, round(TB/AB, 3), NA),
         OPS = OBP + SA,
         slash = ifelse(!is.na(BA), paste(BA, OBP, SA, sep = "/"), NA),
         BABIP = ifelse(AB > 0, round((H-HR)/(AB-SO-HR+SF), 3), NA),
         wOBA = ifelse(AB > 0, (wBB*(BB-IBB) + wHBP*HBP + w1B*(H-`2B`-`3B`-HR) + w2B*`2B` + w3B*`3B` + wHR*HR) /(AB + BB - IBB + SF + HBP), NA),
         wRAA = round(((wOBA - wOBALeague)/wOBAScale) * PA, 0), 
         WRC = round((wOBA - wOBALeague)/wOBAScale + RPALeague*PA, 0),
         BPF = BPF/100, #BPF needs to be a fraction of 1 for calculations.
         WRCplus = round((((wRAA/PA + RPALeague) + (RPALeague - BPF*RPALeague))/(wRCLeague/PALeague)*100), 0),
         wOBA = round(wOBA, 3)) %>%
  select(-(name:RPALeague)) #drop weight variables when we're finished


batting_processed <- batting_processed %>%   #add in readable names from players database
  left_join(players %>% select(playerID, nameFirst, nameLast), by = "playerID") %>%
  mutate(Name = paste(nameLast, nameFirst, sep = ", ")) %>%
  select(-nameFirst, -nameLast)

Let me unpack the code chunk above a little bit. The batting database comes with a number of counting statistics from which all other statistics can be calculated. These are:

  • AB: At Bats
  • H: Total Hits
  • 2B
  • 3B
  • HR
  • BB: Walks (Base on Balls)
  • IBB: Intentional Walks
  • HBP: Hit by Pitch
  • SH: Sacrifice Hits
  • SF: Sacrifice Flies

From these we are able to calculate statistics that give us a better idea of a player’s value to their team. For a full explanation, each term links to the FanGraphs glossary page which includes the equation for its calculation as well as examples.

  • BABIP: Batting Average on Balls in Play - When we compare this value to a player’s batting average overall, we get a sense of how lucky or unlucky a player has been. I.e. when playing against a team with better defensive players this will be lower because they would make outs on plays that may normally drop in for a hit.

  • wOBA: Weighted on-base average gives empirical weights to each possible outcome a batter can have instead of assigning 1 to a single, 2 to a double, etc.

  • wRAA: Weighted Runs Above Average is a stat that attempts to quanitfy how many more runs a player contributed to their team than an average player would. This value can be negative or positive and uses the same weighting values as wOBA. The underlying principle behind sabermetrics is that if we can effectively quantify how many extra runs a player is worth, then we can infer how many extra wins a player can contribute to a team, the ultimate goal.

  • wRC+: Takes wRAA values and adjusts them for the current years’ run scoring environment as well as the hitter friendliness of the home park that they play in. It is also adjusted so that a WRC+ of 100 is league average, and someone with a wRC+ of 130 is considered to have had a 30% greater season at the plate than average.

One problem that I am still working on dealing with in this process is players that spent time with multiple teams during one season. These stints are each given a separate entry in the batting database and will need to be aggregated while accounting for different ball park factors etc. when calculating things like WRC+. With these stats we can start to imagine visualizations that we can create that contain much more information than something like home run or stolen base leaderboards.

I’ll present a couple of examples to give an idea of some of the directions we can explore with this dataset. The first one is the WRCplus profile of a trio of Seattle Mariner greats across their careers. Edgar and Griffey were perennial power and gap hitters, whereas Ichiro was more of a singles hitter and speedster. As we can see, the WRC+ metric seems to favour the power hitter profile, something worth exploring in later posts.

This next graph highlights the BABIP metric for the 2008 season, which if you recall can be thought of as a proxy for how lucky/unlucky a player has been in a season. I’ve added labels to the top 5 luckiest & unluckiest players based on a ratio of BABIP/BA. Players above the regression line were generally luckier than those below.

This ended up being mostly about Batting stats, so I will make a separate post addressing some of the popular modern pitching statistics and how to calculate them using publically available data. Future posts will dive a bit deeper into some player analysis, and predictive modelling for the upcoming 2019 season. As always, code & data will be available on my Github if you would like to explore for yourself.


Recent posts

Exploring the Lahman Database Pt. I

Wind Turbines in the US

My First Post!

My First Shiny App


Archives

2018 (3)
0001 (1)