Introduction to R 2025, Day 2

Ronald Geskus and Thinh Ong Phuc

Learning goals

  • Day 1
    • Basic structure of the R program and language
    • Use the RStudio working environment
    • Import, inspect and manage a data set
      • Select variables (columns) and rows (observations)
      • Missing data, factors, dates
  • Day 2
    • Functions
    • Visualisations using graphics in base R
    • Object specific functions
    • Finding further information

Functions

Basic format

  • All actions are performed via functions: sqrt, mean, help, library, t.test, plot

  • Input: required and optional arguments; within parentheses, separated by comma

    • required: need to be supplied

    • optional: have default values

      Beware of sequence of arguments; required ones come first. Compare

      log(1000, 10) 
      log(10, 1000) 
  • Some functions have special “argument”: ...
    Meaning: anything that makes sense. Example: in c and paste function, which allow for the use of many values to

  • Output: result of calculations (typically assigned to R object), graphics, help window, etc.

  • You can use functions within other functions, e.g. mean(c(3,6,8))

Functions: the inside

  • Function code can be seen by leaving out the parentheses
  • General structure:
FunctionName(args) {
    lines of R code (collection of functions)
}

The last value is returned as output - You can write your own functions:

Basic data visualisation

R graphics systems:

  • Base graphics (package graphics)
  • lattice package. Less popular nowadays. Not covered here.
  • ggplot2 package: Wednesday morning
  • Note: these systems cannot be mixed
  • Each of these systems has many packages with extensions

Basic plot in base graphics

Plot fare paid as function of age

plot(titanic$age, titanic$fare)

Exercise 1: What would you like to change to make it good enough for publication?

  • Formula based specification
    • general form: dependent ~ independent
    plot(fare ~ age, data=titanic)
    • independent: variable names separated by operator (mostly + and *)
    • functions with formula (as first argument) have data and subset argument
    • basis for specification of regression models

Some possible changes: change labels along axes, rotate numbers along y-axis, add regression line, rescale fare

Arguments and parameters that can be changed

  • Within function only
    • xlab, ylab, main
    • type
    • xlim, ylim, log
  • Also via par function
    • cex.axis, cex.lab, las
    • cex, pch
    • col, lty, lwd, cex
  • Only via par function
    • mar, mfcol, mfrow

Some important functions for plotting

  • High level plot functions
    • plot
    • hist
    • boxplot
  • Low level plot functions (add to existing plot)
    • points
    • lines, abline
    • text, legend, title, arrows, segments

Other plots

  1. Plot fare paid as function of age
plot(titanic$age, titanic$fare)

Exerise 2. Make a histogram of age

Exercise 3.a. Make a boxplot of fare by passenger class.

Exercise 3.b. There are high values for fare that disturb the plot. Therefore, only show the fare up to 300 British Pounds.

Export: two types of formats

  • Vector format (pdf, eps, wmf, emf, svg)
    • digital image consisting of independent geometric objects (segments, polygons, curves, etc.)
    • can be enlarged without losing resolution
  • Raster format (png, jpeg, tiff, bmp).
    • rectangular grid of pixels, possibly with color
    • resolution impaired if image is enlarged
  • Graphics can be saved via the menu in the graphics/plots window, or a specific graphics file type can be created directly via code
pdf("foo.pdf",...)
...
dev.off()

or win.metafile("foo.wmf",...) or png("foo.png",...) and ending with dev.off()

Some further topics about R

Structure of R program

  • Everythin created and stored is an object: data, functions, results analysis if assigned within R session
  • Environment: a collection of objects that is accessible in R session
    1. Workspace: objects we create/import in RStudio: Global Environment in Environment pane

    2. Packages with existing functions and data sets: base, stats, graphics

    3. When a package is loaded in R session, a new environment is created

      Listing of environments shown in RStudio via drop-down list under Global Environment in Environment pane

  • Hierarchical structure of environments; needed for dealing with masked objects due to duplicate names

R resembles operating system

R operating system
objects files
Workspace current folder
environments folders in “path” variable
Environment pane in RStudio Explorer window

Object specific functionality (I)

  • Output of function depends on class of object

Examples:

  • summary:
    summary.data.frame, summary.factor, summary.Date
    default: summary.default

  • plot: plot(y ~ x, data=..., ...)

     Plot depends on class of `x`: scatterplot is default, but
     boxplot if `x` is of class factor and plot by year if `x` is of class date

Object specific functionality (II)

Why important to know?

  • Finding information in help file
  • Very efficient in use. E.g. plot is a generic function, with specific methods depending on the class of the object that is the argument of the function.
    We only need to remember plot

Another object mode: list

  • General mode to store information
  • Modes cannot be mixed in vectors
  • Modes can be mixed in a list:

Two components, with names teacher and room

  • Data are of mode list, but with a special class data.frame
mode(titanic)
class(titanic)

Finding Information

Exercise 4

Find information on

  • descriptive statistics
  • change several categorical variables into factor
  • recognize nonstandard date formats

The apply functions

  • apply: apply a function over all columns or all rows of a data.frame

  • lapply: apply a function over a list

  • sapply: similar to lapply but more user-friendly if output can be coerced into a vector

  • tapply: can be used to split a vector in subgroups and apply a function to each of the subgroups

Exercise 5

  • Check the mode of each of the variables in the titanic data set using sapply and mode
  • Make a summary of fare by passenger class using the tapply function