Become familiar and confident with the R program, using RStudio
Not: using R for statistics, bioinformatics or mathematical modeling
On R Project website: “a language and environment for statistical computing and graphics”
Free: no money charged and open source
Runs on all major operating systems
Very flexible and powerful
Steep learning curve(?)
Most analyses performed via writing code in file.
Classical format: script file with “.R” extension.
Newer formats: R Markdown (“.Rmd”) and Quarto (“.qmd”)
Many mathematical operations (+, -, *, /, ^ or **) pre-defined
Add numbers 2 and 7:
Multiply 2 and 7:
Divide 7 by 2:
Compute \(2^7\):
Some operations based on functions: sqrt, log, log2, log10, sum, prod
Compute the square root of 2:
Sum of numbers 1 to 5:
Take the 10-log of 1000 (y with \(10^y=1000\)):
Some constants (\(\pi\), …) pre-defined:
Assign value 2 to object x and show content of x
Without assignment, x keeps old value
For now, we work with subset of titanic data set:
$”, e.g. titanic$age[ ]Select age of passengers 6 and 10
Select age of passengers 6 to 10
Flexibility of seq function
dataset[rows, cols]Columns: selection by name of columns
vec that consists of the numbers from 11 to 30vec.Remark: Exercise e. is not an easy exercise. We advise you to do this is in two steps. First create a vector index that selects the odd numbers in vec, using the function seq. You may need to consult the help page of seq via help(seq).
Use the elementary functions /, -, ^ and the function sum to calculate the mean \[\bar{x}=\sum_{i=1}^n x_i/n\] and the standard deviation \[\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2/(n-1)}\] of the fare paid by the titanic passengers. Note: \(n=10\).
Verify your answer by using the built-in R functions mean and sd.
A function is how we tell the computer to do work for us
Think of it like asking someone to complete a task for you
Example
Thinh told Ronald: “Please give feedback on my measles model manuscript before next Friday”.
How does this translate into R?
give_feedback(type = "manuscript", which = "measles model", deadline = "next Friday")
Everything we do in R boils down to running a function:
Generates some output given some input:
goal( input1, input2, ...)
Output: a value (often assigned to an object for further use); a figure; a help page, …
E.g. help(mean): mean is argument of function help. Output is the appearance of a page with information on the mean function
patients, Data, abc, LetsHaveFunData is not the same as dataNot allowed:
@”, “$”, “+” etc.for, if, while …Better avoid names that are R functions:
sort, c, mean, t, data, q
“_” and “.” are allowed (e.g. sorted.results_file)
Many prefer “_” over “.” to subdivide names
Objects and functions
Everything that exists in R is an object (data, functions, …)
Everything that happens in R is a function call
TRUE and FALSETRUE and FALSE can be used for selectionLogical statements arise from comparing values:
<, >, <=, >===!=Converted to integer if a numeric value is required:
TRUE equals 1, FALSE equals 0
We can calculate with logicals. Main operators are:
&: all must be true “AND”|: at least one must be true “OR”!: negation “NOT”Logical statements about fare
Select fares between 10 and 40
R is a programming language for statistical computing and data visualization.
RStudio is a software designed to make working with R easier by helping you create, edit, and manage R code and projects. More formally, it is known as an Integrated Development Environment (IDE).
An R project is a directory with .Rproj file, signaling RStudio to manage the project settings accordingly.
The process of creating an R project is as followed
In the menu File > New Project…
Click Existing Directory > Browse
Click Browse and click on your project directory
Click Create Project
A minimal R project structure will have the following format
└── my_project
├── output
├── data
│ ├── raw
│ └── processed
└── analysis.R
Where
data folder contains data to be analyzed
output stores code output (plot, figures, etc.)
analysis.R is the file containing R code. There can be multiple .R files under 1 project.
Consists of 4 main panes
Upper left: Shows the content of source code file and of datasets in R session
Console: Shows the executed code lines and their output
Environment: Show the currently defined variables
Files/Plot/Package/Help/…:
Files: for files nagivation
Plots: show plot output
Packages: show all the installed packages and packages being used (packages in use will have a ✓)
Help: show documentations for functions or packages
Regular R users transform recurring tasks into self-made functions
Can make it into a package, i.e. a collection of functions (and data):
Give package developers appropriate credit and cite their packages
Need to be installed on your computer
R program itself also based on packages. Some loaded at startup base, graphics, stats, utils …
E.g. base package contains sqrt function
Other packages need to be loaded before use, best way is via library function
R: see R Data Import/Export
RStudio: Environment pane: Import Dataset; or via File menu: Import Dataset
text and csv (base or readr package)
Excel (readxl package)
SPSS/Stata/SAS (haven package)
Imported as a “tibble” (enhancement of data.frame)
Database: MS Access, SQLite, …
help functionhelp(mean) givesDescription: Generic function for the (trimmed) arithmetic mean.
Usage: mean(x, …)
Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments:
x: an R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
trim:
na.rm: a logical evaluating to ‘TRUE’ or ‘FALSE’ indicating whether NA values should be stripped before the computation proceeds.
Value: If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one.
example function to run the examples: example(mean)NA (short for “not available”) The function is.na checks for missingnessMost functions exclude missings by default
Not always what you want
table excludes missings; we can include them via argument useNA="always"quantile and mean (and some others) give error if there are missings; specify argument na.rm=TRUE
View: opens a spreadsheet-style data viewer
In RStudio: click on the name of a data object in the Environment tab
dim: gives the number of rows and columns
head: shows the first rows of a data frame
tail: similar to head but shows the last rows
str: compact display of the internal structure of an R object
Variable summaries
summary: concise summary of each variablemean: mean valuequantile: quantilesIQR: inter-quartile rangetable: tabulation of one or two categorical variablesData found here: Titanic3.xlsx
Create a named vector:
Select by name:
dimnames because there are two dimensions):subset function: subset(my.data, subset=...),
with ... a logical condition:
Note:%in% is another logical construct
subset argument in functions with formula structure:subset function: subset(my.data, select=...), with ... a selection of columns names:
We can select a sequence of names via : function:
Or use - to exclude columns:
with function to save repetitive use of data set name :
Define the ordering of the “levels” (useful in regression models)
Default is by alphabetical/numeric order
Change the order
To change naming of the levels
Numeric value (units since time origin) with character representation
Origin:
R is very flexible in conversion between textual date representations
as.Date: create date variable
format: change display format
See page 5 of this reference card
Return to main course page