Exploratory data analysis and baseball (aka Moneyball)

janeiro 3, 2017 § Deixe um comentário


Executive Summary

In any professional sports, how well the teams spend their money means more than the difference between a championship and a flop. It’s no different with baseball, the sport that introduces the concepts of professionalism and moneyball.

For those who are not used to the term, moneyball is used to describes baseball operations in which a team endeavors to analyze the market for baseball players and buy who is undervalued and sell who is overvalued. Unlike a common misconception, it is not about on-base percentage (a measure of how often a batter reaches base for any reason other than a fielding error, fielder’s choice, dropped/uncaught third strike, fielder’s obstruction, or catcher’s interference), but to explore methods of rating players.

It is most commonly used to refer to the strategy used by the front office of the 2002 Oakland Athletics, with approximately US$44 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over US$125 million in payroll that same season. It derives its name from the 2003 book from Michael Lewis about the team’s analytical, evidence-based, sabermetric approach. Suffice to say that there is also a 2011 motion picture of the same name, based on the book, starring Brad Pitt and Jonah Hill, for which the term became mainstream.

The data

I will be using data from two very useful databases on baseball teams, players and seasons. One is curated by Sean Lahman, available at http://www.seanlahman.com/baseball-archive/statistics/. The other, is from the nutshell package, which contains data sets used as examples in the book “R in a Nutshell” by Joseph Adler. More information about the package is available at https://cran.r-project.org/web/packages/nutshell/index.html.

The reason for pick two different datasets instead of one is because I wanted to perform the analysis in different sources. The decision proved right for account of speed and practicality too. The Lahman data set uses data on pitching, hitting and fielding performance and other tables from 1871 through 2015. As we can see, is thoroughly and updated. The Nutshell’s on the other hand, is better designed for learning approaches (at least in my opinion) and comprises statistical data from 2000 – 2008 for every Major League Baseball team.

For those who are not familiar with baseball, a few points of explanation are important:

  • Major League Baseball is a professional baseball league, where teams pay players to play baseball (I know it sounds silly and redundant, but I have to be sure everybody knows what we are talking about here).
  • The goal of each team is to win as many games out of a 162 game season as possible. This allows a ticket to the post season and a chance to play at the World Series, where the champion is defined.
  • Teams win games by scoring more runs than their adversary. A run is computed when a player advances around first, second and third base and returns safely to home plate (in other words, do a round around the infield).
  • In principle, better players are expensive, so teams that want good players need to spend more money.
  • Teams that spend the most, frequently won the most (not always but so often that is fair to consider it a case of cause and effect).


I provide the analysis in both data sets in a Markdown page that can be accessed @marcelo_tibau/exploratory-and-baseball

An application

One of the reasons that I chose the nutshell data set is because it is used as a case study from the book “R in a Nutshell” by Joseph Adler. Inspired by this case, I developed a simple app to predicts the number of runs scored by a team based on a linear model which predicts the number of runs scored by a team. For those curious to see it, a demo for the app can be found @baseball-prediction


Onde estou?

Você está atualmente visualizando os arquivos para janeiro, 2017 em Marcelo Tibau.