library(readxl)
library(dplyr)
library(here)
library(stringr)
library(Hmisc)
Capstone 1 (day 1): Import and cleaning data
In this exercise, you’ll practice reading Excel sheets and cleaning the data using basic R functions.
Setup
Open R.project
in the folder cloned from repository.
Create a new script file and save it under the name exercise_day1.R
(you can create rmd or qmd file if you are comfortable with them)
Install the required packages if you haven’t already:
install.packages(c("readxl", "dplyr", "here", "stringr", "Hmisc"))
Then load the libraries using library()
1. Read ADM and ENR data
1.1. Define the File Path
Set the path to the Excel file 2-10-2020-_03TS_V1_Data.xls
stored in the data
folder. You might use here
function from here
package.
❓ Question: What folder is the file located in? Why do we use here()
instead of typing the full path?
✅ Why Use here()?
Ensures your code works across different computers.
Makes file paths dynamic and reproducible.
Works best when you use an RStudio Project (.Rproj file), which sets the root directory.
Example:
<- here("capstone_exercises","data", "2-10-2020-_03TS_V1_Data.xls") # Replace with your actual file name if needed data_path
This points to: <your_project_root>/capstone_exercises/data/2-10-2020-_03TS_V1_Data.xls
1.2. Read the ENR and ADM Sheets
Use read_excel
function to import the excel sheets. Use setNames
+ tolower
to make all variables names become lowercases.
Example:
# Read the ENR sheet
<- read_excel(file_path, sheet = "ENR") %>%
enr setNames(tolower(names(.)))
Write code to import “ADM” sheet to adm
data frame too.
Use summary
, skimr
, str
to inspect our data. Determine whether any cleaning is necessary.
1.3. Clean the Data
For ENR
Most of the time, we receive data with undesired values like Y/N/UNKNOWN
instead of yes, no, or NA
. We need to clean them.
Using mutate
, case_when
and factor
- Replace
"Y"
with"yes"
,"N"
with"no"
- Replace
"UNKNOWN"
withNA
- Convert
"yes"
/"no"
columns to factors labeled0
and1
Example: clean variable age16
<- enr %>%
enr mutate(
age16 = case_when(age16 == "Y" ~ "yes",
== "N" ~ "no",
age16 == "UNKNOWN" ~ NA_character_),
age16 age16 = factor(age16, levels = c("no", "yes"), labels = c(0, 1))
)
You might apply this manually to all other variables in enr and adm.
However, if you need to change many variables, it’s more efficient to use across
function that mutates all of them automatically, as shown below. This approach saves time, ensures that no variables are missed, and eliminates the need to modify the code when new variables are added to the data.
<- enr %>%
enr mutate(across(
where(is.character),
~ case_when(
== "Y" ~ "yes",
.x == "N" ~ "no",
.x == "UNKNOWN" ~ NA_character_,
.x TRUE ~ .x
)%>%
)) mutate(across(
where(~ all(.x %in% c("yes", "no", NA), na.rm = TRUE)),
~ factor(.x, levels = c("no", "yes"), labels = c(0, 1))
))
For ADM
(Repeat the same steps)
Check Your Work
Use summary
, skimr
, str
to explore your cleaned data.
2. Read and Process Allocation Data
In this section, you’ll import an Excel file 03TS_Randlist.xlsx
that contains randomization information and prepare it for analysis.
2.1 Define the File Path
2.2 Read the Allocation Sheet
Now import the "Allocation"
sheet and convert all column names to lowercase.
2.3 Filter and Clean the Data
Next, keep only the first 272 rows (patients) using filter
function.
Then rename the treatment arm column from r.arm
to arm
using rename
.
Create cleaned patient usubjid
columns from pat.id
that matches the format of usubjid
from enr
using mutate
, str_replace
and paste
.
Replace the values in arm
according to the dictionary
sheet in the 03TS_Randlist.xlsx
using mutate
and case_when
<- allocation_data %>%
randolist filter(row_number() %in% c(1:272)) %>%
rename(arm = r.arm) %>%
mutate(
pat.id = str_replace(pat.id, ".*-", ""),
usubjid = paste("003", pat.id, sep = "-"),
arm = case_when(
== "14 ampoules: TETANUS ANTITOXIN (IM) + 2 prefilled-syringes : TETAGAM®P (intrathecal))" ~ "equine and intrathecal",
arm == "14 ampoules: TETANUS ANTITOXIN (IM)" ~ "equine and sham",
arm == "12 prefilled-syringes : TETAGAM®P (IM) + 2 prefilled-syringes : TETAGAM®P (intrathecal))" ~ "human and intrathecal",
arm == "12 prefilled-syringes : TETAGAM®P (IM)" ~ "human and sham"
arm
) )
Use head(randolist)
to check your processed data.
3. Protocol Violations, Exclusions, and Withdrawals
In this section, you will work with a file that contains information about patients who should be excluded from the study or had protocol violations.
3.1 Define the File Path
3.2 Read Each Sheet
Import each relevant sheet from the Excel file and standardize column names.
# Read and standardize all relevant sheets
<- read_excel(violations_file_path, sheet = "Pilot") %>%
pilot setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "MV before IM") %>%
mv_before_im setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "MV before IT") %>%
mv_before_it setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "Protocol violation") %>%
protocol_violation setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "Withdrawals") %>%
withdrawals setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "IT Per protocol") %>%
it_per_protocol setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "IM per protocol") %>%
im_per_protocol setNames(tolower(names(.)))
<- read_excel(violations_file_path, sheet = "IM ITT") %>%
im_itt setNames(tolower(names(.)))
# Rename column for consistency
<- im_itt %>% rename(usubjid = "id") im_itt
Challenge: If you are familiar with writing new functions and using map, try writing code to automatically import all sheets from an Excel file.
Use str()
or names()
to explore the structure of these data frames.
3.3 Identify and Exclude Patients
Now combine excluded patient IDs from the pilot
and withdrawals
sheets using unique
and c
. Filter out excluded patients from the main datasets using filter
.
# Combine USUBJIDs for exclusion
<- unique(c(pilot$usubjid, withdrawals$usubjid))
excluded_subjects
# Filter ENR and ADM to remove excluded subjects
<- enr %>% filter(!usubjid %in% excluded_subjects)
enr <- adm %>% filter(!usubjid %in% excluded_subjects) adm
4. Create and Label Baseline Data
In this section, you’ll merge enrollment, admission, and allocation data, then create new variables and assign labels.
4.1 Merge Datasets
Join enr
, adm
, and randolist
to create a complete baseline dataset byusubjid
using left_join
.
<- enr %>%
baseline_data select(-c(entry, studyid, siteid, subjid, event )) %>%
left_join(adm, by = "usubjid") %>%
left_join(randolist %>% select(usubjid, arm), by = "usubjid")
❓ Question: Why might we want to remove the entry
, studyid
, siteid
, subjid
and event
column before merging? How about enteredtime
?
4.2 Add BMI and Label Variables
Now we use the upData()
function to add a new variable (BMI) and label the columns for easier reference in summaries and tables.
Example:
<- upData(
baseline_data_raw
baseline_data,bmi = weight / ((height / 100) ^ 2),
# New variable derived from weight and height
labels = c(
age = "Age (years)",
icudays = "Days in ICU"
) )
🔍 Try this: Relabel all other variables by adding them to the example above. Run summary(baseline_data_raw)
to explore the labeled dataset.
🧠 Reflection: Why is labeling variables important in medical data analysis?