Overview
This project is part of the bigger project of which I thoroughly
walked through steps in data cleaning & wrangling process. I then
did some descriptive explorations of the dataset. In this project, I
used the pols-month.csv, unemployment.csv, and snp.csv data from FiveThirtyEight.
This project utilized R version 4.3.1, and R packages
tidyverse
and dplyr
.
STEP 1: Clean dataset pols-month.csv
Directions: Use separate()
to break up the variable mon
into integer variables year, month, and day;
replace month number with month name; create a president
variable taking values gop and dem, and remove
prez_dem and prez_gop; and remove the day
variable.
pols_month <-
read_csv("./fivethirtyeight_datasets/pols-month.csv") %>%
separate(mon,
into = c("year","month_num","day"),
convert = TRUE) %>%
mutate("month" = month.name[month_num]) %>%
mutate(president = recode(prez_gop, "0" = "dem", "1" = "gop", "2" = "gop")) %>%
select(year, month, everything(), -day, -prez_gop, -prez_dem, -month_num)
pols_month
## # A tibble: 822 × 9
## year month gov_gop sen_gop rep_gop gov_dem sen_dem rep_dem president
## <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 1947 January 23 51 253 23 45 198 dem
## 2 1947 February 23 51 253 23 45 198 dem
## 3 1947 March 23 51 253 23 45 198 dem
## 4 1947 April 23 51 253 23 45 198 dem
## 5 1947 May 23 51 253 23 45 198 dem
## 6 1947 June 23 51 253 23 45 198 dem
## 7 1947 July 23 51 253 23 45 198 dem
## 8 1947 August 23 51 253 23 45 198 dem
## 9 1947 September 23 51 253 23 45 198 dem
## 10 1947 October 23 51 253 23 45 198 dem
## # ℹ 812 more rows
STEP 2: Clean dataset snp.csv
Directions: Use a similar process to the above. For consistency across datasets, arrange according to year and month, and organize so that year and month are the leading columns.
snp <-
read_csv("./fivethirtyeight_datasets/snp.csv") %>%
separate(date, into = c("month", "day", "year"), sep = "/", convert = TRUE) %>%
arrange(year, month) %>%
mutate("month" = month.name[month]) %>%
select(year, month, close)
snp
## # A tibble: 787 × 3
## year month close
## <int> <chr> <dbl>
## 1 0 January 1394.
## 2 0 February 1366.
## 3 0 March 1499.
## 4 0 April 1452.
## 5 0 May 1421.
## 6 0 June 1455.
## 7 0 July 1431.
## 8 0 August 1518.
## 9 0 September 1437.
## 10 0 October 1429.
## # ℹ 777 more rows
STEP 3: Tidy dataset unemployment.csv
Directions: Tidy the unemployment data so that it can be merged with
the previous datasets. This process will involve switching from
“wide”
to “long”
format; ensuring that key
variables have the same name; and ensuring that key variables take the
same values.
unemployment <-
read_csv("./fivethirtyeight_datasets/unemployment.csv") %>%
pivot_longer(Jan:Dec, names_to = "month_abb", values_to = "unemployment") %>%
mutate("month" = factor(month_abb, levels = month.abb, labels = month.name)) %>%
rename(year = Year) %>%
select(year, month, unemployment)
unemployment
## # A tibble: 816 × 3
## year month unemployment
## <dbl> <fct> <dbl>
## 1 1948 January 3.4
## 2 1948 February 3.8
## 3 1948 March 4
## 4 1948 April 3.9
## 5 1948 May 3.5
## 6 1948 June 3.6
## 7 1948 July 3.6
## 8 1948 August 3.9
## 9 1948 September 3.8
## 10 1948 October 3.7
## # ℹ 806 more rows
STEP 4: Merge datasets
Directions: Join the datasets by merging snp into pols, and merging unemployment into the result.
data_q1 <- full_join(pols_month, snp, by = c("year","month")) %>%
full_join(., unemployment, by = c("year","month"))
summary(data_q1)
## year month gov_gop sen_gop
## Min. : 0 Length:1615 Min. :12.00 Min. :32.0
## 1st Qu.: 68 Class :character 1st Qu.:18.00 1st Qu.:42.0
## Median :1948 Mode :character Median :22.00 Median :46.0
## Mean :1044 Mean :22.48 Mean :46.1
## 3rd Qu.:1982 3rd Qu.:28.00 3rd Qu.:51.0
## Max. :2015 Max. :34.00 Max. :56.0
## NA's :793 NA's :793
## rep_gop gov_dem sen_dem rep_dem
## Min. :141.0 Min. :17.0 Min. :44.00 Min. :188
## 1st Qu.:176.0 1st Qu.:22.0 1st Qu.:48.00 1st Qu.:211
## Median :195.0 Median :28.0 Median :53.00 Median :250
## Mean :194.9 Mean :27.2 Mean :54.41 Mean :245
## 3rd Qu.:222.0 3rd Qu.:32.0 3rd Qu.:58.00 3rd Qu.:268
## Max. :253.0 Max. :41.0 Max. :71.00 Max. :301
## NA's :793 NA's :793 NA's :793 NA's :793
## president close unemployment
## Length:1615 Min. : 17.05 Min. : 2.50
## Class :character 1st Qu.: 83.73 1st Qu.: 4.70
## Mode :character Median : 138.53 Median : 5.60
## Mean : 474.89 Mean : 5.83
## 3rd Qu.: 941.79 3rd Qu.: 6.90
## Max. :2107.39 Max. :10.80
## NA's :828 NA's :805
pols_month There are 822 observations and 9 variables in the cleaned pols_month dataset, ranges between years of 1947 and 2015. Variables are: year, month, gov_gop, sen_gop, rep_gop, gov_dem, sen_dem, rep_dem, president.
snp There are 787 observations and 3 variables in the cleaned snp dataset, ranges between years of 0 and 99. Variables are: year, month, close.
unemployment There are 816 observations and 3 variables in the cleaned unemployment dataset, ranges between years of 1948 and 2015. Variables are: year, month, unemployment.
merged dataset data_q1
There are a
total of 1615 observations and 11 variables in the final merged dataset,
ranges between years of 0 and 2015. Variables are: year, month,
gov_gop, sen_gop, rep_gop, gov_dem, sen_dem, rep_dem, president, close,
unemployment. Note that I used the full_join
function
with the combination of year and month as unique
identifiers here. This way, all data entries from all tables will be
included into the final merged dataset.