Introduction to R, Day 1

Ronald Geskus and Thinh Ong Phuc

Learning goals

Become familiar and confident with the R program, using RStudio

Not: using R for statistics, bioinformatics or mathematical modeling

Day 1
- Basic structure of the R program and language
  - Basic computations and selections
  - Objects and functions
- Use the RStudio working environment
- Import, inspect and manage a data set
  - Select variables (columns) and rows (observations)
  - Missing data, factors, dates

Day 2
- Functions, R graphics, finding further information

The R program

R: What is it and why use it?

On R Project website: “a language and environment for statistical computing and graphics”
Free: no money charged and open source
Runs on all major operating systems
Very flexible and powerful
- Reproducibility: see Friday course
- Community: large, collaborative user base
Steep learning curve(?)

The R phenotypes

R program: very basic functionality via menus

RStudio: user-friendly shell around R

Most analyses performed via writing code in file.

Classical format: script file with “.R” extension.

Newer formats: R Markdown (“.Rmd”) and Quarto (“.qmd”)

Graphical User Interface e.g. BlueSky, R-Instat, R-Commander

Basics of the R Language

R as a pocket calculator I

Many mathematical operations (+, -, *, /, ^ or **) pre-defined

Add numbers 2 and 7:

Multiply 2 and 7:

Divide 7 by 2:

Compute $2^7$:

R as a pocket calculator II

Some operations based on functions: sqrt, log, log2, log10, sum, prod

Compute the square root of 2:

Sum of numbers 1 to 5:

Take the 10-log of 1000 (y with $10^y=1000$):

Some constants ($\pi$, …) pre-defined:

Assignment

Assign value 2 to object x and show content of x

Without assignment, x keeps old value

Datasets

Most often rectangular format
- Columns: variables
- Rows: observations
Example: titanic3 data set; passenger characteristics and survival status after disaster with the Titanic cruise ship

For now, we work with subset of titanic data set:

Data: selections within a single column

Each column is a vector of values
Single column from data set selected via “$”, e.g. titanic$age
Elements of vector selected via numbers within square brackets [ ]

Select age of passengers 6 and 10

Select age of passengers 6 to 10

Flexibility of seq function

Data: selection of rows and columns

Via square brackets and a comma: dataset[rows, cols]

Columns: selection by name of columns

Exercise 1

Create a vector named vec that consists of the numbers from 11 to 30
Select the 7th element of the vector
Select all elements except the 15th. Hint: use a minus sign
Select the 2nd and 5th element of the vector

Select only the odd valued elements of vec.

Remark: Exercise e. is not an easy exercise. We advise you to do this is in two steps. First create a vector index that selects the odd numbers in vec, using the function seq. You may need to consult the help page of seq via help(seq).

Exercise 2

Use the elementary functions /, -, ^ and the function sum to calculate the mean \[\bar{x}=\sum_{i=1}^n x_i/n\] and the standard deviation \[\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2/(n-1)}\] of the fare paid by the titanic passengers. Note: $n=10$.
Verify your answer by using the built-in R functions mean and sd.

Functions (I)

A function is how we tell the computer to do work for us
Think of it like asking someone to complete a task for you

Example

Thinh told Ronald: “Please give feedback on my measles model manuscript before next Friday”.
How does this translate into R?

give_feedback(type = "manuscript", which = "measles model", deadline = "next Friday")

Functions (II)

Everything we do in R boils down to running a function:

Generates some output given some input:

goal( input1, input2, ...)
- goal: what do we want; function name
- input: what information does R need for this; arguments of the function
Output: a value (often assigned to an object for further use); a figure; a help page, …

E.g. help(mean): mean is argument of function help. Output is the appearance of a page with information on the mean function

Many calculation functions are vectorized

Example: change unit of age to decade (10 years)

What does this code give:

More general: iteration via apply functions (Tuesday) and the purrr package (Thursday)

Objects

Everything in R is an object (data, functions, results from analysis, …)
Names for user created objects: patients, Data, abc, LetsHaveFun
Names are case-sensitive: Data is not the same as data

Not allowed:
- space
- characters with special meaning in R: “@”, “$”, “+” etc.
- numbers allowed but not as first character
- specific R programming constructs: for, if, while …
Better avoid names that are R functions:

sort, c, mean, t, data, q

“_” and “.” are allowed (e.g. sorted.results_file)

Many prefer “_” over “.” to subdivide names

Remember

Objects and functions

Everything that exists in R is an object (data, functions, …)
Everything that happens in R is a function call

Modes

Data (variables) come in different types. The most important ones are:
- numeric:
- logical: values TRUE and FALSE
- character:
- You can change the mode of some data objects

Modes: logical (I)

TRUE and FALSE can be used for selection

Logical statements arise from comparing values:
- Smaller/larger than: <, >, <=, >=
- Exact equality: ==
- Not equal to: !=
Converted to integer if a numeric value is required:

TRUE equals 1, FALSE equals 0

Modes: logical (II)

We can calculate with logicals. Main operators are:

&: all must be true “AND”
|: at least one must be true “OR”
!: negation “NOT”

Logical statements about fare

Select fares between 10 and 40

RStudio

R and RStudio

R is a programming language for statistical computing and data visualization.

RStudio is a software designed to make working with R easier by helping you create, edit, and manage R code and projects. More formally, it is known as an Integrated Development Environment (IDE).

Create Project in RStudio

An R project is a directory with .Rproj file, signaling RStudio to manage the project settings accordingly.

The process of creating an R project is as followed

In the menu File > New Project…
Click Existing Directory > Browse
Click Browse and click on your project directory
Click Create Project

Project structure

A minimal R project structure will have the following format

└── my_project
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    └── analysis.R

Where

data folder contains data to be analyzed
output stores code output (plot, figures, etc.)
analysis.R is the file containing R code. There can be multiple .R files under 1 project.

RStudio interface

Consists of 4 main panes

Upper left: Shows the content of source code file and of datasets in R session
Console: Shows the executed code lines and their output
Environment: Show the currently defined variables
Files/Plot/Package/Help/…:
- Files: for files nagivation
- Plots: show plot output
- Packages: show all the installed packages and packages being used (packages in use will have a ✓)
- Help: show documentations for functions or packages

Packages

Regular R users transform recurring tasks into self-made functions
Can make it into a package, i.e. a collection of functions (and data):
- survival
- ggplot2
- tidyverse
- WHO TB data
- AMR
- pharmaverse
- SUSENAS
- vietnameseConverter
- sudoku
- rStrava
Give package developers appropriate credit and cite their packages

Packages, where to find them?

Comprehensive R Archive Network (22127 packages on March 3, 2025)
GitHub (31311 packages on March 3, 2025)
Bioconductor
Review

Need to be installed on your computer
R program itself also based on packages. Some loaded at startup base, graphics, stats, utils …

E.g. base package contains sqrt function
Other packages need to be loaded before use, best way is via library function

Data: import, inspection and management

Data import

R: see R Data Import/Export
RStudio: Environment pane: Import Dataset; or via File menu: Import Dataset

text and csv (base or readr package)

Excel (readxl package)

   library(readxl)
   titanic <- read_excel("data/raw/Titanic3.xlsx")

SPSS/Stata/SAS (haven package)
see Importing Data with RStudio
Imported as a “tibble” (enhancement of data.frame)
Database: MS Access, SQLite, …

Help

To know more about a function or dataset you can often use help function
help(mean) gives

Description: Generic function for the (trimmed) arithmetic mean.

Usage: mean(x, …)

Default S3 method:

mean(x, trim = 0, na.rm = FALSE, ...)

Arguments:

x: an R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.

trim:

na.rm: a logical evaluating to ‘TRUE’ or ‘FALSE’ indicating whether NA values should be stripped before the computation proceeds.

Value: If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one.

Help

Outline of a help page of a function is always the same:
- Description: what does the function do
- Usage: what arguments/input does the function expect
- Arguments: description of the individual arguments
- Value: what is the result of a function call
- often: Details, References, See Also
  - Examples. You can use the example function to run the examples: example(mean)

Missing data

Special value: NA (short for “not available”) The function is.na checks for missingness

Most functions exclude missings by default

Not always what you want
- table excludes missings; we can include them via argument useNA="always"
quantile and mean (and some others) give error if there are missings; specify argument na.rm=TRUE

Data inspection and basic variable summaries

View: opens a spreadsheet-style data viewer

In RStudio: click on the name of a data object in the Environment tab
dim: gives the number of rows and columns
head: shows the first rows of a data frame
tail: similar to head but shows the last rows
str: compact display of the internal structure of an R object

Variable summaries

summary: concise summary of each variable
mean: mean value
quantile: quantiles
IQR: inter-quartile range
table: tabulation of one or two categorical variables

Make exercises 3 and 4

Data found here: Titanic3.xlsx

Selection by name

Create a named vector:

Select by name:

Same for data sets (use dimnames because there are two dimensions):

More efficient row selections: “subset”

subset function: subset(my.data, subset=...),

with ... a logical condition:

subset(titanic, pclass %in% c("1st","2nd"))$survived

Note:%in% is another logical construct

subset argument in functions with formula structure:

xtabs(~survived, data=titanic, subset=(sex=="male"))

Warning

Don’t use “$” for column selection if function has a data argument

xtabs(~titanic$survived, data=titanic, subset=(titanic$sex=="male"))

More efficient column selections

subset function: subset(my.data, select=...), with ... a selection of columns names:
```
head(subset(titanic, select= c(sex,fare)))
```
We can select a sequence of names via : function:
```
head(subset(titanic, select= sex:fare))
```
Or use - to exclude columns:
```
head(subset(titanic, select= -(sex:fare)))
```

with function to save repetitive use of data set name :

with(titanic, table(sex, survived))
## instead of table(titanic$sex, titanic$survived)

Give extra structure to categorical variables: factors

Define the ordering of the “levels” (useful in regression models)

Default is by alphabetical/numeric order
```
table(titanic$sex)
```

Change the order

titanic$sex <- factor(titanic$sex)
levels(titanic$sex)
titanic$sex <- factor(titanic$sex, levels=c("male","female"))
table(titanic$sex)

To change naming of the levels

titanic$sex <- factor(titanic$sex, labels=c("M","F"))

Good introduction to factors

Dates

Numeric value (units since time origin) with character representation
Origin:
- SPSS: October 14, 1582 (seconds)
- R: January 1st, 1970 (days)
- Stata: January 1st, 1960 (days)
R is very flexible in conversion between textual date representations
as.Date: create date variable

format: change display format
See page 5 of this reference card
Packages anytime and lubridate useful
A Comprehensive Introduction to Handling Date & Time

Make exercises 5, 6 and 7

Return to main course page