Become familiar and confident with the R program, using RStudio
Not: using R for statistics, bioinformatics or mathematical modeling
On R Project website: “a language and environment for statistical computing and graphics”
Free: no money charged and open source
Runs on all major operating systems
Very flexible and powerful
Steep learning curve(?)
Most analyses performed via writing code in file.
Classical format: script file with “.R
” extension.
Newer formats: R Markdown (“.Rmd
”) and Quarto (“.qmd
”)
Many mathematical operations (+
, -
, *
, /
, ^
or **
) pre-defined
Add numbers 2 and 7:
Multiply 2 and 7:
Divide 7 by 2:
Compute \(2^7\):
Some operations based on functions: sqrt
, log
, log2
, log10
, sum
, prod
Compute the square root of 2:
Sum of numbers 1 to 5:
Take the 10-log of 1000 (y with \(10^y=1000\)):
Some constants (\(\pi\), …) pre-defined:
Assign value 2 to object x and show content of x
Without assignment, x keeps old value
For now, we work with subset of titanic data set:
$
”, e.g. titanic$age
[ ]
Select age of passengers 6 and 10
Select age of passengers 6 to 10
Flexibility of seq
function
dataset[rows, cols]
Columns: selection by name of columns
vec
that consists of the numbers from 11 to 30vec
.Remark: Exercise e. is not an easy exercise. We advise you to do this is in two steps. First create a vector index
that selects the odd numbers in vec
, using the function seq
. You may need to consult the help page of seq
via help(seq)
.
Use the elementary functions /
, -
, ^
and the function sum
to calculate the mean \[\bar{x}=\sum_{i=1}^n x_i/n\] and the standard deviation \[\sqrt{\sum_{i=1}^{n}(x_i-\bar{x})^2/(n-1)}\] of the fare paid by the titanic passengers. Note: \(n=10\).
Verify your answer by using the built-in R functions mean
and sd
.
A function is how we tell the computer to do work for us
Think of it like asking someone to complete a task for you
Example
Thinh told Ronald: “Please give feedback on my measles model manuscript before next Friday”.
How does this translate into R?
give_feedback(type = "manuscript", which = "measles model", deadline = "next Friday")
Everything we do in R boils down to running a function:
Generates some output given some input:
goal( input1, input2, ...)
Output: a value (often assigned to an object for further use); a figure; a help page, …
E.g. help(mean)
: mean
is argument of function help
. Output is the appearance of a page with information on the mean
function
patients, Data, abc, LetsHaveFun
Data
is not the same as data
Not allowed:
@
”, “$
”, “+
” etc.for, if, while
…Better avoid names that are R functions:
sort, c, mean, t, data, q
“_
” and “.
” are allowed (e.g. sorted.results_file
)
Many prefer “_
” over “.
” to subdivide names
Objects and functions
Everything that exists in R is an object (data, functions, …)
Everything that happens in R is a function call
TRUE
and FALSE
TRUE
and FALSE
can be used for selectionLogical statements arise from comparing values:
<
, >
, <=
, >=
==
!=
Converted to integer if a numeric value is required:
TRUE
equals 1, FALSE
equals 0
We can calculate with logicals. Main operators are:
&
: all must be true “AND”|
: at least one must be true “OR”!
: negation “NOT”Logical statements about fare
Select fares between 10 and 40
R is a programming language for statistical computing and data visualization.
RStudio is a software designed to make working with R easier by helping you create, edit, and manage R code and projects. More formally, it is known as an Integrated Development Environment (IDE).
An R project is a directory with .Rproj
file, signaling RStudio to manage the project settings accordingly.
The process of creating an R project is as followed
In the menu File
> New Project…
Click Existing Directory
> Browse
Click Browse
and click on your project directory
Click Create Project
A minimal R project structure will have the following format
└── my_project
├── output
├── data
│ ├── raw
│ └── processed
└── analysis.R
Where
data
folder contains data to be analyzed
output
stores code output (plot, figures, etc.)
analysis.R
is the file containing R code. There can be multiple .R
files under 1 project.
Consists of 4 main panes
Upper left: Shows the content of source code file and of datasets in R session
Console: Shows the executed code lines and their output
Environment: Show the currently defined variables
Files/Plot/Package/Help/…:
Files: for files nagivation
Plots: show plot output
Packages: show all the installed packages and packages being used (packages in use will have a ✓)
Help: show documentations for functions or packages
Regular R users transform recurring tasks into self-made functions
Can make it into a package, i.e. a collection of functions (and data):
Give package developers appropriate credit and cite their packages
Need to be installed on your computer
R program itself also based on packages. Some loaded at startup base, graphics, stats, utils …
E.g. base package contains sqrt
function
Other packages need to be loaded before use, best way is via library
function
R: see R Data Import/Export
RStudio: Environment pane: Import Dataset; or via File menu: Import Dataset
text and csv (base
or readr
package)
Excel (readxl
package)
SPSS/Stata/SAS (haven
package)
Imported as a “tibble” (enhancement of data.frame
)
Database: MS Access, SQLite, …
help
functionhelp(mean)
givesDescription: Generic function for the (trimmed) arithmetic mean.
Usage: mean(x, …)
Default S3 method:
mean(x, trim = 0, na.rm = FALSE, ...)
Arguments:
x: an R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects.
trim:
na.rm: a logical evaluating to ‘TRUE’ or ‘FALSE’ indicating whether NA
values should be stripped before the computation proceeds.
Value: If trim is zero (the default), the arithmetic mean of the values in x is computed, as a numeric or complex vector of length one.
example
function to run the examples: example(mean)
NA
(short for “not available”) The function is.na
checks for missingnessMost functions exclude missings by default
Not always what you want
table
excludes missings; we can include them via argument useNA="always"
quantile
and mean
(and some others) give error if there are missings; specify argument na.rm=TRUE
View
: opens a spreadsheet-style data viewer
In RStudio: click on the name of a data object in the Environment tab
dim
: gives the number of rows and columns
head
: shows the first rows of a data frame
tail
: similar to head
but shows the last rows
str
: compact display of the internal structure of an R object
Variable summaries
summary
: concise summary of each variablemean
: mean valuequantile
: quantilesIQR
: inter-quartile rangetable
: tabulation of one or two categorical variablesData found here: Titanic3.xlsx
Create a named vector:
Select by name:
dimnames
because there are two dimensions):subset
function: subset(my.data, subset=...)
,
with ...
a logical condition:
Note:%in%
is another logical construct
subset
argument in functions with formula structure:subset
function: subset(my.data, select=...)
, with ...
a selection of columns names:
We can select a sequence of names via :
function:
Or use -
to exclude columns:
with
function to save repetitive use of data set name :
Define the ordering of the “levels” (useful in regression models)
Default is by alphabetical/numeric order
Change the order
To change naming of the levels
Numeric value (units since time origin) with character representation
Origin:
R is very flexible in conversion between textual date representations
as.Date
: create date variable
format
: change display format
See page 5 of this reference card
Return to main course page