Overview
This project is part of the bigger project of which I thoroughly
walked through steps in data cleaning & wrangling process. I then
did some descriptive explorations of the dataset. In this project, I
used the data from Mr. Trash
Wheel. This project utilized R version 4.3.1, and R packages
tidyverse
and dplyr
.
STEP 1: read & clean Mr. Trash Wheel sheet
Directions:
- specify the sheet in the Excel file and to omit non-data entries
(rows with notes / figures; columns containing notes) using arguments in
read_excel
- use reasonable variable names
- omit rows that do not include dumpster-specific data
- The data include a column for the (approximate) number of homes powered. This calculation is described in the Homes powered note, but not applied to every row in the dataset. Update the data to include a new homes_powered variable based on this calculation.
mr_trash_wheel <-
readxl::read_excel("202309\ Trash\ Wheel\ Collection\ Data.xlsx",
sheet = "Mr. Trash Wheel", col_names = TRUE, range = "A2:N586") %>%
janitor::clean_names() %>%
mutate("day" = format(date, format="%d")) %>%
mutate("year" = format(year, format="%Y")) %>%
select(dumpster, year, month, day, everything(), -date) %>%
mutate("homes_powered" = (weight_tons*500)/30) %>%
mutate("dumpster" = paste("mr", dumpster, sep="_"))
head(mr_trash_wheel)
## # A tibble: 6 × 14
## dumpster year month day weight_tons volume_cubic_yards plastic_bottles
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 mr_1 2014 May 16 4.31 18 1450
## 2 mr_2 2014 May 16 2.74 13 1120
## 3 mr_3 2014 May 16 3.45 15 2450
## 4 mr_4 2014 May 17 3.1 15 2380
## 5 mr_5 2014 May 17 4.06 18 980
## 6 mr_6 2014 May 20 2.71 13 1430
## # ℹ 7 more variables: polystyrene <dbl>, cigarette_butts <dbl>,
## # glass_bottles <dbl>, plastic_bags <dbl>, wrappers <dbl>,
## # sports_balls <dbl>, homes_powered <dbl>
I changed the date variable in this sheet to day as there are year and month variables. I also changed the format of year to character. The Mr. Trash Wheel sheet contains 14 variables and 584 observations. Variables in this sheet include: dumpster, year, month, day, weight_tons, volume_cubic_yards, plastic_bottles, polystyrene, cigarette_butts, glass_bottles, plastic_bags, wrappers, sports_balls, homes_powered. Variable dumpster is the primary key that contains values unique to each observation.
STEP 2: read & clean Professor Trash Wheel sheet
professor_trash_wheel <-
readxl::read_excel("202309\ Trash\ Wheel\ Collection\ Data.xlsx",
sheet = "Professor Trash Wheel", col_names = TRUE, range = "A2:M108") %>%
janitor::clean_names() %>%
mutate("day" = format(date, format="%d")) %>%
mutate("year" = format(year, format="%Y")) %>%
select(dumpster, year, month, day, everything(), -date) %>%
mutate("homes_powered" = (weight_tons*500)/30) %>%
mutate("dumpster" = paste("professor", dumpster, sep="_"))
head(professor_trash_wheel)
## # A tibble: 6 × 13
## dumpster year month day weight_tons volume_cubic_yards plastic_bottles
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 professor_1 2017 January 02 1.79 15 1950
## 2 professor_2 2017 January 30 1.58 15 9540
## 3 professor_3 2017 Februa… 26 2.32 18 8350
## 4 professor_4 2017 Februa… 26 3.72 15 8590
## 5 professor_5 2017 Februa… 28 1.45 15 7830
## 6 professor_6 2017 March 30 1.71 15 8210
## # ℹ 6 more variables: polystyrene <dbl>, cigarette_butts <dbl>,
## # glass_bottles <dbl>, plastic_bags <dbl>, wrappers <dbl>,
## # homes_powered <dbl>
I changed the date variable in this sheet to day as there are year and month variables. I also changed the format of year to character. The Professor Trash Wheel sheet contains 13 variables and 106 observations. Variables in this sheet include: dumpster, year, month, day, weight_tons, volume_cubic_yards, plastic_bottles, polystyrene, cigarette_butts, glass_bottles, plastic_bags, wrappers, homes_powered. Variable dumpster is the primary key that contains values unique to each observation.
STEP 3: read & clean Gwynnda Trash Wheel sheet
gwynnda_trash_wheel <-
readxl::read_excel("202309\ Trash\ Wheel\ Collection\ Data.xlsx",
sheet = "Gwynnda Trash Wheel", col_names = TRUE, range = "A2:L157") %>%
janitor::clean_names() %>%
mutate("day" = format(date, format="%d")) %>%
mutate("year" = format(year, format="%Y")) %>%
select(dumpster, year, month, day, everything(), -date) %>%
mutate("homes_powered" = (weight_tons*500)/30) %>%
mutate("dumpster" = paste("gwynnda", dumpster, sep="_"))
head(gwynnda_trash_wheel)
## # A tibble: 6 × 12
## dumpster year month day weight_tons volume_cubic_yards plastic_bottles
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 gwynnda_1 2021 July 03 0.93 15 1200
## 2 gwynnda_2 2021 July 07 2.26 15 2000
## 3 gwynnda_3 2021 July 07 1.62 15 1800
## 4 gwynnda_4 2021 July 16 1.76 15 1000
## 5 gwynnda_5 2021 July 30 1.53 15 2100
## 6 gwynnda_6 2021 August 11 2.06 15 2400
## # ℹ 5 more variables: polystyrene <dbl>, cigarette_butts <dbl>,
## # plastic_bags <dbl>, wrappers <dbl>, homes_powered <dbl>
I changed the date variable in this sheet to day as there are year and month variables. I also changed the format of year to character. The Professor Trash Wheel sheet contains 12 variables and 155 observations. Variables in this sheet include: dumpster, year, month, day, weight_tons, volume_cubic_yards, plastic_bottles, polystyrene, cigarette_butts, plastic_bags, wrappers, homes_powered. Variable dumpster is the primary key that contains values unique to each observation.
STEP 4: merge the three sheets
data_q2 <-
full_join(mr_trash_wheel, professor_trash_wheel) %>%
full_join(., gwynnda_trash_wheel)
summary(data_q2)
## dumpster year month day
## Length:845 Length:845 Length:845 Length:845
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## weight_tons volume_cubic_yards plastic_bottles polystyrene
## Min. :0.610 Min. : 5.00 Min. : 0 Min. : 0
## 1st Qu.:2.490 1st Qu.:15.00 1st Qu.:1000 1st Qu.: 280
## Median :3.070 Median :15.00 Median :1980 Median : 950
## Mean :3.009 Mean :15.13 Mean :2296 Mean : 1631
## 3rd Qu.:3.540 3rd Qu.:15.00 3rd Qu.:2900 3rd Qu.: 2400
## Max. :5.620 Max. :20.00 Max. :9830 Max. :11528
## NA's :1 NA's :1
## cigarette_butts glass_bottles plastic_bags wrappers
## Min. : 0 Min. : 0.00 Min. : 0 Min. : 180
## 1st Qu.: 3200 1st Qu.: 10.00 1st Qu.: 280 1st Qu.: 840
## Median : 5500 Median : 18.00 Median : 680 Median : 1380
## Mean : 15592 Mean : 20.89 Mean : 1082 Mean : 2330
## 3rd Qu.: 16000 3rd Qu.: 28.00 3rd Qu.: 1400 3rd Qu.: 2635
## Max. :310000 Max. :110.00 Max. :13450 Max. :20100
## NA's :1 NA's :156 NA's :1 NA's :118
## sports_balls homes_powered
## Min. : 0.00 Min. :10.17
## 1st Qu.: 6.00 1st Qu.:41.50
## Median :11.00 Median :51.17
## Mean :13.17 Mean :50.16
## 3rd Qu.:18.25 3rd Qu.:59.00
## Max. :56.00 Max. :93.67
## NA's :261
Note that I used the full_join
function here. This way,
all data entries from all tables will be included into the final merged
dataset. The merged data contains 14 variables and 845 observations.
Variables in this sheet include: dumpster, year, month, day,
weight_tons, volume_cubic_yards, plastic_bottles, polystyrene,
cigarette_butts, glass_bottles, plastic_bags, wrappers, sports_balls,
homes_powered. Variable dumpster is the primary
key that contains values unique to each observation.
The total weight of trash collected by Professor Trash Wheel is 216.26. The total number of cigarette butts collected by Gwynnda in July of 2021 is 16300.