Overview
This project is part of the bigger project of which I thoroughly
walked through steps in data cleaning & wrangling process. I then
did some descriptive explorations of the dataset. In this project, I
used data collected in an observational study to understand the
trajectory of Alzheimer’s disease (AD) biomarkers, retried from HERE. This project
utilized R version 4.3.1, and R packages tidyverse
and
dplyr
.
STEP 1: Clean Baseline Dataset
Directions: Import, clean, and tidy the dataset of baseline demographics. Ensure that sex and APOE4 carrier status are appropriate encoded (i.e. not numeric), and remove any participants who do not meet the stated inclusion criteria (i.e. no MCI at baseline).
mci_baseline_description <- read_csv("./data_mci/MCI_baseline.csv") %>% colnames()
mci_baseline_description #checking variable descriptions
## [1] "...1"
## [2] "Age at the study baseline"
## [3] "1 = Male, 0 = Female"
## [4] "Years of education"
## [5] "1 = APOE4 carrier, 0 = APOE4 non-carrier"
## [6] "Age at the onset of MCI; missing if a subject remains MCI free during the follow-up period"
mci_baseline <-
#use second row as header
read_csv("./data_mci/MCI_baseline.csv", skip = 1) %>%
#clean variable names to snake_case
janitor::clean_names() %>%
#recode sex & apoe4 as character variables
mutate(sex = recode(sex, "1" = "Male", "0" = "Female")) %>%
mutate(apoe4 = recode(apoe4, "1" = "APOE4 carrier", "0" = "APOE4 non-carrier")) %>%
filter(age_at_onset > current_age | age_at_onset == ".") %>%
mutate(age_at_onset = as.numeric(age_at_onset))
summary(mci_baseline)
## id current_age sex education
## Min. : 1.0 Min. :56.00 Length:479 Min. :12.0
## 1st Qu.:121.5 1st Qu.:63.15 Class :character 1st Qu.:16.0
## Median :242.0 Median :64.90 Mode :character Median :16.0
## Mean :242.0 Mean :65.03 Mean :16.4
## 3rd Qu.:362.5 3rd Qu.:67.00 3rd Qu.:18.0
## Max. :483.0 Max. :72.90 Max. :20.0
##
## apoe4 age_at_onset
## Length:479 Min. :61.20
## Class :character 1st Qu.:68.20
## Mode :character Median :70.20
## Mean :70.41
## 3rd Qu.:73.40
## Max. :77.20
## NA's :386
When importing the dataset, we can notice that the default header of the uncleaned dataset are actually descriptions of each variable. Printing the dataset out, we would notice that the actual header is actually the second row: id, current_age, sex, education, apoe4, age_at_onset.
A total of 479 participants were recruited, and of these participants, 93 participants developed MCI. The average baseline age is 65.0286013, with a range of 56 to 72.9. 30% of women are APOE4 carriers.
STEP 2: Clean Amyloid Dataset
Directions: Import, clean, and tidy the dataset of longitudinally observed biomarker values; comment on the steps on the import process and the features of the dataset.
mci_amyloid_description <- read_csv("./data_mci/mci_amyloid.csv") %>% colnames()
mci_amyloid_description #checking variable descriptions
## [1] "Study ID"
## [2] "Time (in years) elapsed since the study baseline to the visit where biomarker Amyloid _ 42/40 ratio was measured"
## [3] "NA...3"
## [4] "NA...4"
## [5] "NA...5"
## [6] "NA...6"
mci_amyloid <-
#use second row as header
read_csv("./data_mci/mci_amyloid.csv", skip = 1) %>%
#clean variable names to snake_case
janitor::clean_names() %>%
#rename variables to match baseline dataset
rename(id = study_id, time_0 = baseline) %>%
#change character variables of amyroid ratios to numeric
mutate(time_0 = as.numeric(time_0), time_2 = as.numeric(time_2), time_4 = as.numeric(time_4),
time_6 = as.numeric(time_6), time_8 = as.numeric(time_8)) %>%
#tidy dataset
pivot_longer(., time_0:time_8, names_to = "years_elapsed_since_baseline", values_to = "amyroid_ratio", names_prefix = "time_")
summary(mci_amyloid)
## id years_elapsed_since_baseline amyroid_ratio
## Min. : 1.0 Length:2435 Min. :0.09938
## 1st Qu.:125.0 Class :character 1st Qu.:0.10752
## Median :248.0 Mode :character Median :0.10967
## Mean :248.6 Mean :0.10969
## 3rd Qu.:372.0 3rd Qu.:0.11187
## Max. :495.0 Max. :0.11871
## NA's :172
There are a total of 2435 observations, 487 unique participants and 3
variables in this dataset. When importing the dataset, we can notice
that the default header of the uncleaned dataset are actually
descriptions of each variable. Printing the dataset out, we would notice
that the actual header is actually the second row: id,
years_elapsed_since_baseline, amyroid_ratio. To tidy the dataset, I
used pivot_longer
function to change several columns of
different times of followup to rows.
There are 8 participants who appeared in only the baseline dataset
but not the amyloid dataset, and 16 participants who appeared in only
the amyloid dataset not the baseline dataset. To ensure that only
participants with entries in both dataset are included into the final
dataset, I will merge the two datasets using the inner_join
function.
Step 3: Merge the two datasets
Before merging, it is always good to use anti_join
to
check if there are entries in one dataset that do not have a
corresponding match in the other based on specified key columns. This is
especially important for datasets with matching unique IDs.
## Joining with `by = join_by(id)`
## # A tibble: 8 × 6
## id current_age sex education apoe4 age_at_onset
## <dbl> <dbl> <chr> <dbl> <chr> <dbl>
## 1 14 58.4 Female 20 APOE4 non-carrier 66.2
## 2 49 64.7 Male 16 APOE4 non-carrier 68.4
## 3 92 68.6 Female 20 APOE4 non-carrier NA
## 4 179 68.1 Male 16 APOE4 non-carrier NA
## 5 268 61.4 Female 18 APOE4 carrier 67.5
## 6 304 63.8 Female 16 APOE4 non-carrier NA
## 7 389 59.3 Female 16 APOE4 non-carrier NA
## 8 412 67 Male 16 APOE4 carrier NA
We see that there are 8 entries in mci_baseline that do not have a match in mci_amyloid. This hints to loss-to-follow-up and may warrant concerns regarding selection bias. Keep this noted, let’s continue on merging.
## id current_age sex education
## Min. : 1.0 Min. :56.00 Length:2355 Min. :12.00
## 1st Qu.:122.0 1st Qu.:63.20 Class :character 1st Qu.:16.00
## Median :242.0 Median :64.90 Mode :character Median :16.00
## Mean :242.5 Mean :65.05 Mean :16.38
## 3rd Qu.:363.0 3rd Qu.:67.00 3rd Qu.:18.00
## Max. :483.0 Max. :72.90 Max. :20.00
##
## apoe4 age_at_onset years_elapsed_since_baseline
## Length:2355 Min. :61.20 Length:2355
## Class :character 1st Qu.:68.40 Class :character
## Mode :character Median :70.35 Mode :character
## Mean :70.51
## 3rd Qu.:73.60
## Max. :77.20
## NA's :1905
## amyroid_ratio
## Min. :0.09938
## 1st Qu.:0.10752
## Median :0.10967
## Mean :0.10969
## 3rd Qu.:0.11188
## Max. :0.11871
## NA's :167
There are 2355 observations, 471 unique participants, and 8 variables in the final merged dataset for MCI. Variables include id, current_age, sex, education, apoe4, age_at_onset, years_elapsed_since_baseline, amyroid_ratio. An Aβ42/40 ratio <0.150 suggests a higher risk of having of AD pathology compared to higher values. 2 years after baseline, 89.6% of participants have An Aβ42/40 ratio <0.150. 4 years after baseline, the proportion increased to 91.08%. 6 years after baseline, the proportion increased to 91.72%. 8 years after baseline, the proportion increased to 92.36%. Among those that are female, the proportion of having an Aβ42/40 ratio <0.150 8 years after baseline is 91.71%. This proportion is 92.86% for males.