exploratory data analysis

Exploratory analysis and regression model helping fight Zika

março 2, 2017 § Deixe um comentário

Alerta Zika! was a collaborative event to explore the potential of data and technology to improve responses to the Zika virus (more information here). The Inter-American Development Bank organized it with the support of several partners including Rio de Janeiro City Hall and some major Universities based in the city. From December 2^nd to 3^rd 2016, about 10 registered teams explored the epidemiological, environmental and social factors to understand and explain the progress of this disease. It was one and a half day of hard work to sum up on the efforts to fight the Zika disease in Rio de Janeiro. We gain access to the dataset with all the cases of Zika, Dengue and Chikungunya registered in Rio de Janeiro city during 2015 and 2016. In order to know the data, our team started to ‘play’ with the dataset and check the variables. In doing so, we fancied about the Zika’s evolution pattern and its role during the outbreak at the early months of 2016.

Our hypothesis was that the disease propagation pattern and their correlations throughout time, city areas and weather could be used as an indicator to show where and when the disease spreads and help the city officials decide the best ways to allocate resources. We set as our goal then, to create a Rio de Janeiro map with a historical evolution of the Zika disease throughout time and temperature. During several conversations with representatives from the municipal health secretary, we wondered whether a social development indicator could provide insights about the spreading pattern. We decided to include HDI (Human Development Indicator) – known in Brazil as IDH – as a social parameter.

We then defined as our target variables the coordinates (latitude and longitude), the dates that the cases occurred, temperature over the seasons and social development indicators of Rio regarding income, education level and longevity. We set as our preliminary tasks the creation of a grid comprising the Rio de Janeiro city map and a data frame to aggregate the variables subset from different datasets. Our first goal as we performed an exploratory analysis was to explore the shape of the distributions. The grid helped us to check where the cases were located; exploring an area of about 400 meters, which is the mosquito range, as well as to cluster the patients’ cases in broader areas. It allowed us too to check how far the disease spread throughout the city and to identify the areas where most of the cases took place.

Performing a time-series analysis, we were able to identify a correlation between temperature and number of cases. In this point, to understand the mosquito life cycle is valuable. The aedes aegypti flourish in a temperature variation going from 23-Celsius degree to 28-Celsius degree (about 73 to 82-Fahrenheit). A few degrees below or above this threshold does not necessary kill the mosquito but makes the environment more uncomfortable to its development hence retarding its evolution. From the egg to inoculate the virus in an individual, there’s a 20 to 25 days period, so the previous month mosquito is responsible for the current month patient. As it can be seen at the plots bellow, comparing the disease cases through the city by month and the temperature variation per day of the previous month, the outbreak during March and April (plots 3 and 4) follows a perfect condition observed throughout February and March, where the 23 to 28-Celsius threshold was observed during most of the days. The red circles correspond to the areas with the majority of cases.

Plot 1

Plot 2

Plot 3

Plot 4

Plot 5

Plot 6

Plot 7

This led us to our first meaningful insight: the temperature from the previous month seems to affect the number of cases in the current month.

As we shift our attention to the social indicator data at hand, we were able to identify a curious behavior. Some critical areas during the outbreak shared a similarity of low IDH coefficient. The plot bellow provide a visual support. The orange circle sizes are relate to the level of IDH, smaller circle/lower IDH and vice-versa.

Plot 8

The highlighted areas on the plots above correspond to Maré (a neighborhood), the far-north zone and the far-west zone of the city. These areas share a lower level at the social indicators comparing to other parts of the municipality.

Although income seems not to be a social influence affecting the outbreak – as can be seen at the plot bellow – comparing the south zone behavior, the wealthiest part of the city, to the cases in March and April indicate, there is a peculiarity to consider. In this particular area, there is a huge economic disparity. The most exclusive addresses are placed within walking distance to some favelas (slums usually placed on hills around the area), where the IDH are similar to those on the previous plot).

Plot 9

From this observation, we draw a second meaningful insight: the social indicators (IDH) seems to count as an influence force in the areas with most number of cases during the outbreak.

As we went further on our analysis, other curious behavior caught our attention. Even as the temperature dropped away from the 23 to 28-Celsius threshold, some areas kept appearing as the top score case holders (as it can be seen in the comparisons bellow from May to July).

Plot 10

What these areas have in common is that woods and forests surround them all. This common factor provided the third meaningful insight we delivered: some recurrent disease focus areas seems to grow around or close to woods and forests areas.

Exploratory analysis usually is a good start to predictive modeling because helps to understand a little further the datasets and to summarize their main characteristics. Explore the data and formulate hypotheses that could lead to new data collection and experiments is a major component to extract usable information from data; suggest hypotheses; and support the selection of appropriate statistical tools and techniques. Our main goal at the data expedition were to set a first step that could help to understand the past behavior in order to prepare the ‘seeds’ to a future ‘crop’. Our third place award was a source of pride to ourselves and seems to indicate that this goal was accomplished.

After the Data Expedition

We continue exploring the data and aggregating other variables. Our goal was get some predictive model that could add on the initial exploratory analysis. These new variables were population per neighborhood and rainfall. We also add more data about temperature regarding the final months of 2016 and early january 2017.

The first choice was a simple linear regression using the variable population per neighborhood to predict cases based on population. Below some code chunks in R and statistical readings (we intend to show more info in a markdown file – a type of file where can be shown text, code and plots together).

The model:

A quickly view of the dataset:

“bairro” stands for neighborhood; “casos_zika” for Zika cases; and “populacao” for population.

Some statistical Reading from Rstudio console:

Diagnostic Plots

In plot 1 (Residuals vs. Fitted), at some point there’s equally spread residuals around a horizontal line, but also there are outliers. In plot 2 (Normal Q-Q) the residuals seems to be normally distributed, at least at some extend.

Plot1

Plot 2

In plot 3 (Scale-Location), complementing plot 1, some residuals are spread equally along the ranges of predictors showing some homoscedasticity. Plot 4 (Residuals Vs. Leverage) identified the influential observation as #120 and #23.

Plot 3

Plot 4

Based on the thesis that the mosquito has a faster cycle when there’s a temperature threshold between 23 and 28 degrees Celsius, we tried to check if rainfall also helps in the proliferation. Then, we tried to identify the relationship between these two variables and the number of Zika cases. Our second choice was to use a multiple regression model to meet this goal. This analysis were performed in Phython.

The chart below shows that the months with the highest incidence of the temperature threshold are those between December 2015 and April 2016. We could also notice that trend occurring again in December 2016 and early January 2017.

Green and red lines: temperature threshold

Pink and grey lines: min and max temperature

The next sequence of plots show that there is a similar positive trend between the curves showing the number of cases per week, the temperature and rainfall. The analysis was performed based on the neighborhoods of Campo Grande (1), Santa Cruz (2), and Guaratiba(3), that were severed affected during the 2016 outbreak.

Blue line: cases per week

Red line: temperature under the threshold

Yellow line: rainfall

(1)

(2)

(3)

We decided to use the multiple regression model to build a predictive application. We tested the model through a series of plots comparing the actual data with a predicted one applied in a test dataset used to fit the model.

Testing values Vs. Predicted values for Rio de Janeiro

Green dots: testing values

Gray dots: predicted values

Real Cases Vs. Predicted Cases for Rio de Janeiro

Green lines: Real Cases

Gray lines: Predicted Cases

Real Cases Vs. Predicted Cases Comparison for Rio de Janeiro – december 2015 & 2016

download (1).png

Green: Real Cases

Gray: Predicted Cases

In this particular case (above plot) we didn’t had available the number of actual Zika cases in December 2016, so we only predicted the number of cases.

Analysis per neighborhood: Campo Grande.

Statistical Readings from Jupyter Notebook console:

Green line: real cases

Gray line: predicted cases

Analysis per neighborhood: Santa Cruz.

Statistical Readings

Green line: real cases

Gray line: predicted cases

Analysis per neighborhood: Guaratiba.

Statistical Readings

Green lines: testing values

Gray lines: predicted values

We created a prototype to apply the model. It’s a website with information about the number of Zika cases per month and graphics showing the actual cases and the predicted ones per neighborhood.

For those who would like to check it out, it’s available here.

Exploratory data analysis and baseball (aka Moneyball)

janeiro 3, 2017 § Deixe um comentário

Executive Summary

In any professional sports, how well the teams spend their money means more than the difference between a championship and a flop. It’s no different with baseball, the sport that introduces the concepts of professionalism and moneyball.

For those who are not used to the term, moneyball is used to describes baseball operations in which a team endeavors to analyze the market for baseball players and buy who is undervalued and sell who is overvalued. Unlike a common misconception, it is not about on-base percentage (a measure of how often a batter reaches base for any reason other than a fielding error, fielder’s choice, dropped/uncaught third strike, fielder’s obstruction, or catcher’s interference), but to explore methods of rating players.

It is most commonly used to refer to the strategy used by the front office of the 2002 Oakland Athletics, with approximately US$44 million in salary, were competitive with larger market teams such as the New York Yankees, who spent over US$125 million in payroll that same season. It derives its name from the 2003 book from Michael Lewis about the team’s analytical, evidence-based, sabermetric approach. Suffice to say that there is also a 2011 motion picture of the same name, based on the book, starring Brad Pitt and Jonah Hill, for which the term became mainstream.

The data

I will be using data from two very useful databases on baseball teams, players and seasons. One is curated by Sean Lahman, available at http://www.seanlahman.com/baseball-archive/statistics/. The other, is from the nutshell package, which contains data sets used as examples in the book “R in a Nutshell” by Joseph Adler. More information about the package is available at https://cran.r-project.org/web/packages/nutshell/index.html.

The reason for pick two different datasets instead of one is because I wanted to perform the analysis in different sources. The decision proved right for account of speed and practicality too. The Lahman data set uses data on pitching, hitting and fielding performance and other tables from 1871 through 2015. As we can see, is thoroughly and updated. The Nutshell’s on the other hand, is better designed for learning approaches (at least in my opinion) and comprises statistical data from 2000 – 2008 for every Major League Baseball team.

For those who are not familiar with baseball, a few points of explanation are important:

Major League Baseball is a professional baseball league, where teams pay players to play baseball (I know it sounds silly and redundant, but I have to be sure everybody knows what we are talking about here).
The goal of each team is to win as many games out of a 162 game season as possible. This allows a ticket to the post season and a chance to play at the World Series, where the champion is defined.
Teams win games by scoring more runs than their adversary. A run is computed when a player advances around first, second and third base and returns safely to home plate (in other words, do a round around the infield).
In principle, better players are expensive, so teams that want good players need to spend more money.
Teams that spend the most, frequently won the most (not always but so often that is fair to consider it a case of cause and effect).

Analysis

I provide the analysis in both data sets in a Markdown page that can be accessed @marcelo_tibau/exploratory-and-baseball

An application

One of the reasons that I chose the nutshell data set is because it is used as a case study from the book “R in a Nutshell” by Joseph Adler. Inspired by this case, I developed a simple app to predicts the number of runs scored by a team based on a linear model which predicts the number of runs scored by a team. For those curious to see it, a demo for the app can be found @baseball-prediction