Introduction to R, Day 1

Ronald Geskus and Thinh Ong Phuc

Learning goals

Become familiar and confident with the R program, using RStudio

Not: using R for statistics, bioinformatics or mathematical modeling

  • Day 1
    • Basic structure of the R program and language
      • Basic computations and selections
      • Objects and functions
    • Use the RStudio working environment
    • Import, inspect and manage a data set
      • Select variables (columns) and rows (observations)
      • Missing data, factors, dates
  • Day 2
    • Functions, R graphics, finding further information

The R program

R: What is it and why use it?

  • On R Project website: “a language and environment for statistical computing and graphics”

  • Free: no money charged and open source

  • Runs on all major operating systems

  • Very flexible and powerful

    • Reproducibility: see Friday course
    • Community: large, collaborative user base
  • Steep learning curve(?)

The R phenotypes

  • R program: very basic functionality via menus

  • RStudio: user-friendly shell around R
  • Most analyses performed via writing code in file.

    Classical format: script file with “.R” extension.

    Newer formats: R Markdown (“.Rmd”) and Quarto (“.qmd”)

Basics of the R Language

R as a pocket calculator I

Many mathematical operations (+, -, *, /, ^ or **) pre-defined

Add numbers 2 and 7:

Multiply 2 and 7:

Divide 7 by 2:

Compute \(2^7\):

R as a pocket calculator II

Some operations based on functions: sqrt, log, log2, log10, sum, prod

Compute the square root of 2:

Sum of numbers 1 to 5:

Take the 10-log of 1000 (y with \(10^y=1000\)):

Some constants (\(\pi\), …) pre-defined:

Assignment

Assign value 2 to object x and show content of x

Without assignment, x keeps old value

Datasets

  • Most often rectangular format
    • Columns: variables
    • Rows: observations
  • Example: titanic3 data set; passenger characteristics and survival status after disaster with the Titanic cruise ship

For now, we work with subset of titanic data set:

Data: selections within a single column

  • Each column is a vector of values
  • Single column from data set selected via “$”, e.g. titanic$age
  • Elements of vector selected via numbers within square brackets [ ]

Select age of passengers 6 and 10

Select age of passengers 6 to 10

Flexibility of seq function

Data: selection of rows and columns

  • Via square brackets and a comma: dataset[rows, cols]

Columns: selection by name of columns

Exercise 1

  1. Create a vector named vec that consists of the numbers from 11 to 30
  2. Select the 7th element of the vector
  3. Select all elements except the 15th. Hint: use a minus sign
  4. Select the 2nd and 5th element of the vector
  1. Select only the odd valued elements of vec.

Remark: Exercise e. is not an easy exercise. We advise you to do this is in two steps. First create a vector index that selects the odd numbers in vec, using the function seq. You may need to consult the help page of seq via help(seq).

Exercise 2

Use the elementary functions /, -, ^ and the function sum to calculate the mean \[\bar{x}=\sum_{i=1}^n x_i/n\] and the standard deviation \[\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2/(n-1)}\] of the fare paid by the titanic passengers. Note: \(n=10\).
Verify your answer by using the built-in R functions mean and sd.

Functions (I)

  • A function is how we tell the computer to do work for us

  • Think of it like asking someone to complete a task for you

Example

Thinh told Ronald: “Please give feedback on my measles model manuscript before next Friday”.
How does this translate into R?

give_feedback(type = "manuscript", which = "measles model", deadline = "next Friday")

Functions (II)

  • Everything we do in R boils down to running a function:

    Generates some output given some input:

    goal( input1, input2, ...)

    • goal: what do we want; function name
    • input: what information does R need for this; arguments of the function
  • Output: a value (often assigned to an object for further use); a figure; a help page, …

    E.g. help(mean): mean is argument of function help. Output is the appearance of a page with information on the mean function

Many calculation functions are vectorized

  • Example: change unit of age to decade (10 years)
  • What does this code give:
  • More general: iteration via apply functions (Tuesday) and the purrr package (Thursday)

Objects

  • Everything in R is an object (data, functions, results from analysis, …)
  • Names for user created objects: patients, Data, abc, LetsHaveFun
  • Names are case-sensitive: Data is not the same as data
  • Not allowed:

    • space
    • characters with special meaning in R: “@”, “$”, “+” etc.
    • numbers allowed but not as first character
    • specific R programming constructs: for, if, while
  • Better avoid names that are R functions:

    sort, c, mean, t, data, q

  • _” and “.” are allowed (e.g. sorted.results_file)

    Many prefer “_” over “.” to subdivide names

Remember

Objects and functions

  1. Everything that exists in R is an object (data, functions, …)

  2. Everything that happens in R is a function call

Modes

  • Data (variables) come in different types. The most important ones are:
    • numeric:
    • logical: values TRUE and FALSE
    • character:
    • You can change the mode of some data objects

Modes: logical (I)

  • TRUE and FALSE can be used for selection
  • Logical statements arise from comparing values:

    • Smaller/larger than: <, >, <=, >=
    • Exact equality: ==
    • Not equal to: !=
  • Converted to integer if a numeric value is required:

    TRUE equals 1, FALSE equals 0

Modes: logical (II)

We can calculate with logicals. Main operators are:

  • &: all must be true “AND”
  • |: at least one must be true “OR”
  • !: negation “NOT”

Logical statements about fare

Select fares between 10 and 40

RStudio

R and RStudio

R is a programming language for statistical computing and data visualization.

RStudio is a software designed to make working with R easier by helping you create, edit, and manage R code and projects. More formally, it is known as an Integrated Development Environment (IDE).

Create Project in RStudio

An R project is a directory with .Rproj file, signaling RStudio to manage the project settings accordingly.

The process of creating an R project is as followed

  • In the menu File > New Project…

  • Click Existing Directory > Browse

  • Click Browse and click on your project directory

  • Click Create Project

Project structure

A minimal R project structure will have the following format

└── my_project
    ├── output
    ├── data
    │   ├── raw
    │   └── processed
    └── analysis.R 

Where

  • data folder contains data to be analyzed

  • output stores code output (plot, figures, etc.)

  • analysis.R is the file containing R code. There can be multiple .R files under 1 project.

RStudio interface

RStudio interface

Consists of 4 main panes

  • Upper left: Shows the content of source code file and of datasets in R session

  • Console: Shows the executed code lines and their output

  • Environment: Show the currently defined variables

  • Files/Plot/Package/Help/…:

    • Files: for files nagivation

    • Plots: show plot output

    • Packages: show all the installed packages and packages being used (packages in use will have a ✓)

    • Help: show documentations for functions or packages

Packages

Packages, where to find them?

  • Need to be installed on your computer

  • R program itself also based on packages. Some loaded at startup base, graphics, stats, utils

    E.g. base package contains sqrt function

  • Other packages need to be loaded before use, best way is via library function

Data: import, inspection and management

Data import

  1. R: see R Data Import/Export

  2. RStudio: Environment pane: Import Dataset; or via File menu: Import Dataset

Help

  • To know more about a function or dataset you can often use help function
  • help(mean) gives

Description: Generic function for the (trimmed) arithmetic mean.

Usage: mean(x, …)

Default S3 method:

mean(x, trim = 0, na.rm = FALSE, ...)

Arguments:

x: an R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.

trim:

na.rm: a logical evaluating to ‘TRUE’ or ‘FALSE’ indicating whether NA values should be stripped before the computation proceeds.

Value: If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one.

Help

  • Outline of a help page of a function is always the same:
    • Description: what does the function do
    • Usage: what arguments/input does the function expect
    • Arguments: description of the individual arguments
    • Value: what is the result of a function call
    • often: Details, References, See Also
      • Examples. You can use the example function to run the examples: example(mean)

Missing data

  • Special value: NA (short for “not available”) The function is.na checks for missingness
  • Most functions exclude missings by default

    Not always what you want

    • table excludes missings; we can include them via argument useNA="always"
  • quantile and mean (and some others) give error if there are missings; specify argument na.rm=TRUE

Data inspection and basic variable summaries

  • View: opens a spreadsheet-style data viewer

    In RStudio: click on the name of a data object in the Environment tab

  • dim: gives the number of rows and columns

  • head: shows the first rows of a data frame

  • tail: similar to head but shows the last rows

  • str: compact display of the internal structure of an R object

Variable summaries

  • summary: concise summary of each variable
  • mean: mean value
  • quantile: quantiles
  • IQR: inter-quartile range
  • table: tabulation of one or two categorical variables

Make exercises 3 and 4

Data found here: Titanic3.xlsx

Selection by name

Create a named vector:

Select by name:

  • Same for data sets (use dimnames because there are two dimensions):

More efficient row selections: “subset”

  • subset function: subset(my.data, subset=...),

    with ... a logical condition:

subset(titanic, pclass %in% c("1st","2nd"))$survived

Note:%in% is another logical construct

  • subset argument in functions with formula structure:
xtabs(~survived, data=titanic, subset=(sex=="male"))    

Warning

  • Don’t use “$” for column selection if function has a data argument
xtabs(~titanic$survived, data=titanic, subset=(titanic$sex=="male"))    

More efficient column selections

  • subset function: subset(my.data, select=...), with ... a selection of columns names:

    head(subset(titanic, select= c(sex,fare)))

    We can select a sequence of names via : function:

    head(subset(titanic, select= sex:fare))

    Or use - to exclude columns:

    head(subset(titanic, select= -(sex:fare)))
  • with function to save repetitive use of data set name :

    with(titanic, table(sex, survived))
    ## instead of table(titanic$sex, titanic$survived)

Give extra structure to categorical variables: factors

  • Define the ordering of the “levels” (useful in regression models)

    • Default is by alphabetical/numeric order

      table(titanic$sex)
    • Change the order

      titanic$sex <- factor(titanic$sex)
      levels(titanic$sex)
      titanic$sex <- factor(titanic$sex, levels=c("male","female"))
      table(titanic$sex)
  • To change naming of the levels

    titanic$sex <- factor(titanic$sex, labels=c("M","F"))
  • Good introduction to factors

Dates

  • Numeric value (units since time origin) with character representation

  • Origin:

    • SPSS: October 14, 1582 (seconds)
    • R: January 1st, 1970 (days)
    • Stata: January 1st, 1960 (days)
  • R is very flexible in conversion between textual date representations

  • as.Date: create date variable

    format: change display format

  • See page 5 of this reference card

  • Packages anytime and lubridate useful

  • A Comprehensive Introduction to Handling Date & Time

Make exercises 5, 6 and 7

Return to main course page