Go Back

Overview

This project is part of the bigger project of which I thoroughly walked through steps in data cleaning & wrangling process. I then did some descriptive explorations of the dataset. In this project, I used data collected in an observational study to understand the trajectory of Alzheimer’s disease (AD) biomarkers, retried from HERE. This project utilized R version 4.3.1, and R packages tidyverse and dplyr.

library(tidyverse)
library(dplyr)

STEP 1: Clean Baseline Dataset

Directions: Import, clean, and tidy the dataset of baseline demographics. Ensure that sex and APOE4 carrier status are appropriate encoded (i.e. not numeric), and remove any participants who do not meet the stated inclusion criteria (i.e. no MCI at baseline).

mci_baseline_description <- read_csv("./data_mci/MCI_baseline.csv") %>% colnames()
mci_baseline_description #checking variable descriptions
## [1] "...1"                                                                                      
## [2] "Age at the study baseline"                                                                 
## [3] "1 = Male, 0 = Female"                                                                      
## [4] "Years of education"                                                                        
## [5] "1 = APOE4 carrier, 0 = APOE4 non-carrier"                                                  
## [6] "Age at the onset of MCI; missing if a subject remains MCI free during the follow-up period"
mci_baseline <- 
  #use second row as header
  read_csv("./data_mci/MCI_baseline.csv", skip = 1) %>% 
  #clean variable names to snake_case
  janitor::clean_names() %>% 
  #recode sex & apoe4 as character variables
  mutate(sex = recode(sex, "1" = "Male", "0" = "Female")) %>% 
  mutate(apoe4 = recode(apoe4, "1" = "APOE4 carrier", "0" = "APOE4 non-carrier")) %>%
  filter(age_at_onset > current_age | age_at_onset == ".") %>%
  mutate(age_at_onset = as.numeric(age_at_onset))
summary(mci_baseline)
##        id         current_age        sex              education   
##  Min.   :  1.0   Min.   :56.00   Length:479         Min.   :12.0  
##  1st Qu.:121.5   1st Qu.:63.15   Class :character   1st Qu.:16.0  
##  Median :242.0   Median :64.90   Mode  :character   Median :16.0  
##  Mean   :242.0   Mean   :65.03                      Mean   :16.4  
##  3rd Qu.:362.5   3rd Qu.:67.00                      3rd Qu.:18.0  
##  Max.   :483.0   Max.   :72.90                      Max.   :20.0  
##                                                                   
##     apoe4            age_at_onset  
##  Length:479         Min.   :61.20  
##  Class :character   1st Qu.:68.20  
##  Mode  :character   Median :70.20  
##                     Mean   :70.41  
##                     3rd Qu.:73.40  
##                     Max.   :77.20  
##                     NA's   :386

When importing the dataset, we can notice that the default header of the uncleaned dataset are actually descriptions of each variable. Printing the dataset out, we would notice that the actual header is actually the second row: id, current_age, sex, education, apoe4, age_at_onset.

A total of 479 participants were recruited, and of these participants, 93 participants developed MCI. The average baseline age is 65.0286013, with a range of 56 to 72.9. 30% of women are APOE4 carriers.

STEP 2: Clean Amyloid Dataset

Directions: Import, clean, and tidy the dataset of longitudinally observed biomarker values; comment on the steps on the import process and the features of the dataset.

mci_amyloid_description <- read_csv("./data_mci/mci_amyloid.csv") %>% colnames()
mci_amyloid_description #checking variable descriptions
## [1] "Study ID"                                                                                                        
## [2] "Time (in years) elapsed since the study baseline to the visit where biomarker Amyloid _ 42/40 ratio was measured"
## [3] "NA...3"                                                                                                          
## [4] "NA...4"                                                                                                          
## [5] "NA...5"                                                                                                          
## [6] "NA...6"
mci_amyloid <- 
  #use second row as header
  read_csv("./data_mci/mci_amyloid.csv", skip = 1) %>% 
  #clean variable names to snake_case
  janitor::clean_names() %>%  
  #rename variables to match baseline dataset
  rename(id = study_id, time_0 = baseline) %>% 
  #change character variables of amyroid ratios to numeric
  mutate(time_0 = as.numeric(time_0), time_2 = as.numeric(time_2), time_4 = as.numeric(time_4),
         time_6 = as.numeric(time_6), time_8 = as.numeric(time_8)) %>% 
  #tidy dataset
  pivot_longer(., time_0:time_8, names_to = "years_elapsed_since_baseline", values_to = "amyroid_ratio", names_prefix = "time_")  
summary(mci_amyloid)
##        id        years_elapsed_since_baseline amyroid_ratio    
##  Min.   :  1.0   Length:2435                  Min.   :0.09938  
##  1st Qu.:125.0   Class :character             1st Qu.:0.10752  
##  Median :248.0   Mode  :character             Median :0.10967  
##  Mean   :248.6                                Mean   :0.10969  
##  3rd Qu.:372.0                                3rd Qu.:0.11187  
##  Max.   :495.0                                Max.   :0.11871  
##                                               NA's   :172

There are a total of 2435 observations, 487 unique participants and 3 variables in this dataset. When importing the dataset, we can notice that the default header of the uncleaned dataset are actually descriptions of each variable. Printing the dataset out, we would notice that the actual header is actually the second row: id, years_elapsed_since_baseline, amyroid_ratio. To tidy the dataset, I used pivot_longer function to change several columns of different times of followup to rows.

There are 8 participants who appeared in only the baseline dataset but not the amyloid dataset, and 16 participants who appeared in only the amyloid dataset not the baseline dataset. To ensure that only participants with entries in both dataset are included into the final dataset, I will merge the two datasets using the inner_join function.

Step 3: Merge the two datasets

Before merging, it is always good to use anti_join to check if there are entries in one dataset that do not have a corresponding match in the other based on specified key columns. This is especially important for datasets with matching unique IDs.

not_matched <- anti_join(mci_baseline, mci_amyloid)
## Joining with `by = join_by(id)`
not_matched
## # A tibble: 8 × 6
##      id current_age sex    education apoe4             age_at_onset
##   <dbl>       <dbl> <chr>      <dbl> <chr>                    <dbl>
## 1    14        58.4 Female        20 APOE4 non-carrier         66.2
## 2    49        64.7 Male          16 APOE4 non-carrier         68.4
## 3    92        68.6 Female        20 APOE4 non-carrier         NA  
## 4   179        68.1 Male          16 APOE4 non-carrier         NA  
## 5   268        61.4 Female        18 APOE4 carrier             67.5
## 6   304        63.8 Female        16 APOE4 non-carrier         NA  
## 7   389        59.3 Female        16 APOE4 non-carrier         NA  
## 8   412        67   Male          16 APOE4 carrier             NA

We see that there are 8 entries in mci_baseline that do not have a match in mci_amyloid. This hints to loss-to-follow-up and may warrant concerns regarding selection bias. Keep this noted, let’s continue on merging.

data_q3 <- inner_join(mci_baseline, mci_amyloid, by = "id")
summary(data_q3)
##        id         current_age        sex              education    
##  Min.   :  1.0   Min.   :56.00   Length:2355        Min.   :12.00  
##  1st Qu.:122.0   1st Qu.:63.20   Class :character   1st Qu.:16.00  
##  Median :242.0   Median :64.90   Mode  :character   Median :16.00  
##  Mean   :242.5   Mean   :65.05                      Mean   :16.38  
##  3rd Qu.:363.0   3rd Qu.:67.00                      3rd Qu.:18.00  
##  Max.   :483.0   Max.   :72.90                      Max.   :20.00  
##                                                                    
##     apoe4            age_at_onset   years_elapsed_since_baseline
##  Length:2355        Min.   :61.20   Length:2355                 
##  Class :character   1st Qu.:68.40   Class :character            
##  Mode  :character   Median :70.35   Mode  :character            
##                     Mean   :70.51                               
##                     3rd Qu.:73.60                               
##                     Max.   :77.20                               
##                     NA's   :1905                                
##  amyroid_ratio    
##  Min.   :0.09938  
##  1st Qu.:0.10752  
##  Median :0.10967  
##  Mean   :0.10969  
##  3rd Qu.:0.11188  
##  Max.   :0.11871  
##  NA's   :167

There are 2355 observations, 471 unique participants, and 8 variables in the final merged dataset for MCI. Variables include id, current_age, sex, education, apoe4, age_at_onset, years_elapsed_since_baseline, amyroid_ratio. An Aβ42/40 ratio <0.150 suggests a higher risk of having of AD pathology compared to higher values. 2 years after baseline, 89.6% of participants have An Aβ42/40 ratio <0.150. 4 years after baseline, the proportion increased to 91.08%. 6 years after baseline, the proportion increased to 91.72%. 8 years after baseline, the proportion increased to 92.36%. Among those that are female, the proportion of having an Aβ42/40 ratio <0.150 8 years after baseline is 91.71%. This proportion is 92.86% for males.

STEP 4: Save cleaned file

We can save our cleaned dataset as follows:

write.csv(data_q3, file = "data_q3.csv")