Data Cleaning: FiveThirtyEight

Overview

This project is part of the bigger project of which I thoroughly walked through steps in data cleaning & wrangling process. I then did some descriptive explorations of the dataset. In this project, I used the pols-month.csv, unemployment.csv, and snp.csv data from FiveThirtyEight. This project utilized R version 4.3.1, and R packages tidyverse and dplyr.

library(tidyverse)
library(dplyr)

STEP 1: Clean dataset pols-month.csv

Directions: Use separate() to break up the variable mon into integer variables year, month, and day; replace month number with month name; create a president variable taking values gop and dem, and remove prez_dem and prez_gop; and remove the day variable.

pols_month <- 
  read_csv("./fivethirtyeight_datasets/pols-month.csv") %>% 
  separate(mon, 
         into = c("year","month_num","day"), 
         convert = TRUE) %>% 
  mutate("month" = month.name[month_num]) %>%
  mutate(president = recode(prez_gop, "0" = "dem", "1" = "gop", "2" = "gop")) %>%
  select(year, month, everything(), -day, -prez_gop, -prez_dem, -month_num) 
pols_month
## # A tibble: 822 × 9
##     year month     gov_gop sen_gop rep_gop gov_dem sen_dem rep_dem president
##    <int> <chr>       <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>    
##  1  1947 January        23      51     253      23      45     198 dem      
##  2  1947 February       23      51     253      23      45     198 dem      
##  3  1947 March          23      51     253      23      45     198 dem      
##  4  1947 April          23      51     253      23      45     198 dem      
##  5  1947 May            23      51     253      23      45     198 dem      
##  6  1947 June           23      51     253      23      45     198 dem      
##  7  1947 July           23      51     253      23      45     198 dem      
##  8  1947 August         23      51     253      23      45     198 dem      
##  9  1947 September      23      51     253      23      45     198 dem      
## 10  1947 October        23      51     253      23      45     198 dem      
## # ℹ 812 more rows

STEP 2: Clean dataset snp.csv

Directions: Use a similar process to the above. For consistency across datasets, arrange according to year and month, and organize so that year and month are the leading columns.

snp <- 
  read_csv("./fivethirtyeight_datasets/snp.csv") %>%
  separate(date, into = c("month", "day", "year"), sep = "/", convert = TRUE) %>%
  arrange(year, month) %>%
  mutate("month" = month.name[month]) %>%
  select(year, month, close) 
snp
## # A tibble: 787 × 3
##     year month     close
##    <int> <chr>     <dbl>
##  1     0 January   1394.
##  2     0 February  1366.
##  3     0 March     1499.
##  4     0 April     1452.
##  5     0 May       1421.
##  6     0 June      1455.
##  7     0 July      1431.
##  8     0 August    1518.
##  9     0 September 1437.
## 10     0 October   1429.
## # ℹ 777 more rows

STEP 3: Tidy dataset unemployment.csv

Directions: Tidy the unemployment data so that it can be merged with the previous datasets. This process will involve switching from “wide” to “long” format; ensuring that key variables have the same name; and ensuring that key variables take the same values.

unemployment <- 
  read_csv("./fivethirtyeight_datasets/unemployment.csv") %>%
  pivot_longer(Jan:Dec, names_to = "month_abb", values_to = "unemployment") %>%
  mutate("month" = factor(month_abb, levels = month.abb, labels = month.name)) %>%
  rename(year = Year) %>%
  select(year, month, unemployment)
unemployment
## # A tibble: 816 × 3
##     year month     unemployment
##    <dbl> <fct>            <dbl>
##  1  1948 January            3.4
##  2  1948 February           3.8
##  3  1948 March              4  
##  4  1948 April              3.9
##  5  1948 May                3.5
##  6  1948 June               3.6
##  7  1948 July               3.6
##  8  1948 August             3.9
##  9  1948 September          3.8
## 10  1948 October            3.7
## # ℹ 806 more rows

STEP 4: Merge datasets

Directions: Join the datasets by merging snp into pols, and merging unemployment into the result.

data_q1 <- full_join(pols_month, snp, by = c("year","month")) %>%
  full_join(., unemployment, by = c("year","month"))
summary(data_q1)
##       year         month              gov_gop         sen_gop    
##  Min.   :   0   Length:1615        Min.   :12.00   Min.   :32.0  
##  1st Qu.:  68   Class :character   1st Qu.:18.00   1st Qu.:42.0  
##  Median :1948   Mode  :character   Median :22.00   Median :46.0  
##  Mean   :1044                      Mean   :22.48   Mean   :46.1  
##  3rd Qu.:1982                      3rd Qu.:28.00   3rd Qu.:51.0  
##  Max.   :2015                      Max.   :34.00   Max.   :56.0  
##                                    NA's   :793     NA's   :793   
##     rep_gop         gov_dem        sen_dem         rep_dem   
##  Min.   :141.0   Min.   :17.0   Min.   :44.00   Min.   :188  
##  1st Qu.:176.0   1st Qu.:22.0   1st Qu.:48.00   1st Qu.:211  
##  Median :195.0   Median :28.0   Median :53.00   Median :250  
##  Mean   :194.9   Mean   :27.2   Mean   :54.41   Mean   :245  
##  3rd Qu.:222.0   3rd Qu.:32.0   3rd Qu.:58.00   3rd Qu.:268  
##  Max.   :253.0   Max.   :41.0   Max.   :71.00   Max.   :301  
##  NA's   :793     NA's   :793    NA's   :793     NA's   :793  
##   president             close          unemployment  
##  Length:1615        Min.   :  17.05   Min.   : 2.50  
##  Class :character   1st Qu.:  83.73   1st Qu.: 4.70  
##  Mode  :character   Median : 138.53   Median : 5.60  
##                     Mean   : 474.89   Mean   : 5.83  
##                     3rd Qu.: 941.79   3rd Qu.: 6.90  
##                     Max.   :2107.39   Max.   :10.80  
##                     NA's   :828       NA's   :805

pols_month There are 822 observations and 9 variables in the cleaned pols_month dataset, ranges between years of 1947 and 2015. Variables are: year, month, gov_gop, sen_gop, rep_gop, gov_dem, sen_dem, rep_dem, president.

snp There are 787 observations and 3 variables in the cleaned snp dataset, ranges between years of 0 and 99. Variables are: year, month, close.

unemployment There are 816 observations and 3 variables in the cleaned unemployment dataset, ranges between years of 1948 and 2015. Variables are: year, month, unemployment.

merged dataset data_q1 There are a total of 1615 observations and 11 variables in the final merged dataset, ranges between years of 0 and 2015. Variables are: year, month, gov_gop, sen_gop, rep_gop, gov_dem, sen_dem, rep_dem, president, close, unemployment. Note that I used the full_join function with the combination of year and month as unique identifiers here. This way, all data entries from all tables will be included into the final merged dataset.

STEP 5: Save cleaned file

We can save our cleaned dataset as follows:

write.csv(data_q1, file = "data_q1.csv")