Advance R topics

Tuyen Huynh - OUCRU Mathematical Modelling group
Hai Ho - OUCRU Emerging Infections group

Topic overview

Morning:

Into the tidyverse
Data ingestion
Pivot tibbles into long and wide formats
How to handle NAs
Select columns

Afternoon:

Transforming tibbles
- Column manipulation
- Type assignment
Joining tibbles
Vectorisation in R with map() (for loop)

Appendix

If-else statements
For-loops
Piping in R
Summarising tibbles (push to day 5)

Into the `tidyverse`

tidyverse is a collection of R packages built for data science
Provide alternatives to base R functionalities
All packages shared the same principle of tidy data

Into the `tidyverse`

The main “unit” in tidyverse is a tibble
A tibble is a similar to a dataframe in base R
Difference:
- tibble looks different when printed out
- seamless integration with every function in the tidyverse
- and more…

Into the `tidyverse`

Difference between a data.frame and a tibble

mtcars

                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

rownames_to_column(mtcars, var = "model") |> as_tibble()

# A tibble: 32 × 12
   model         mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
 2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
 3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
 4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
 5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
 6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
 7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
 8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
 9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
# ℹ 22 more rows

Tidy data

A tibble is a table of rows and columns, following the principle of tidy data
Each column is a variable
Each row is an observation
Each cell contains 1 value only

Tidy data

Which of these is a tidy data table

# A tibble: 2 × 4
  country code  `2015` `2016`
  <chr>   <chr>  <dbl>  <dbl>
1 Aruba   ABW   28419. 28450.
2 Albania ALB    3953.  4124.

# A tibble: 4 × 4
  country code   year    gdp
  <chr>   <chr> <dbl>  <dbl>
1 Aruba   ABW    2015 28419.
2 Aruba   ABW    2016 28450.
3 Albania ALB    2015  3953.
4 Albania ALB    2016  4124.

Tidy data

R object type: data.frame or tibble
Data structure: tidy or untidy data

Why `tidyverse`?

Takeaway points

tidyverse is not a replacement of base R. It is just a collection of R packages
You don’t have to use tidyverse when using R
tidyverse has most functions for data science needs. Though, there are times base R will be needed, and might be better/easier/faster

Into the `tidyverse`

tidyverse functions that we will be using:

dplyr::select() to select columns from a tibble
dplyr::mutate() to modify and create columns in a tibble
dplyr::*_join() to join tibbles in a specified method
tidyr::pivot_longer() to transform a tibble from wide to long format (less columns, more rows)
tidyr::pivot_wider() to transform a tibble from long to wide format (less rows, more columns)
purrr::map() to performing vectorisation of function on vectors in R

Piping in R

Typically in R, we perform a sequence of operations on a dataset, changing it as we go
R has a functional style, which means the structure is typically: new_data <- function(data, extra_arguments)
- function describes your action, what you want to do
- data is the data that you are execute the action on
- extra_arguments are (optional) settings that changes how the action is performed
- new_data is the output, what you get after performing the action

Piping in R

Example code

plot_dat <- ungroup(summarise(group_by(mtcars, gear), mean_mpg = mean(mpg)))

ggplot(plot_dat, aes(x = gear, y = mean_mpg)) +
  geom_col()

How easy is it to understand/follow this code?
Can we improve the readability?

Piping in R

The example code can be rewritten as:

grouped_by_gear <- group_by(mtcars, gear)
mean_mpg_by_gear  <- summarise(grouped_by_gear, mean_mpg = mean(mpg))
ungrouped_data <- ungroup(mean_mpg_by_gear)

ggplot(
  data = ungrouped_data,
  aes(x = gear, y = mean_mpg)
) +
  geom_col()

Is this a better way to write it? Can we improve it even further?
If we are performing a sequence of actions, each using the output of the previous action, we can use pipe

Piping with `|>`

Pipe is a powerful tool to express a sequence of actions (functions)
It helps you write code that is easier to read and understand
In R, you can pipe between functions using the |> operator
Rstudio shortcut: Cmd+Shift+M or Ctrl+Shift+M

|> vs. %>%

|> comes from R since version 4.1.0. It functions largely the same as %>% but not identical
%>% comes from the magrittr package which is used by all of tidyverse
In this presentation |> will be used

Piping with `|>`

Pipes transfer the data from its left-hand side (LHS) to the function on its right-hand side (RHS) as the first argument of that function
The structure:

new_data <- data |> function(extra_arguments)

Piping with `|>`

For example:

group_by(mtcars, gear)

is exactly the same as

mtcars |> group_by(gear)

Piping with `|>`

Another example:

summarise(group_by(mtcars, gear), mean_mpg = mean(mpg))

is exactly the same as

mtcars |> group_by(gear) |> summarise(mean_mpg = mean(mpg))

Piping with `|>`

Using pipe, we can now rewrite the code as:

plot_dat <- mtcars |> 
  group_by(gear) |> 
  summarise(mean_mpg = mean(mpg)) |> 
  ungroup()

plot_dat |> 
  ggplot(aes(x = gear, y = mean_mpg)) +
  geom_col()

For you, is it better/faster to understand what’s happening now?

Piping with `|>`

Quick comparison

grouped_by_gear <- group_by(mtcars, gear)
mean_mpg_by_gear  <- summarise(grouped_by_gear, mean_mpg = mean(mpg))
ungrouped_data <- ungroup(mean_mpg_by_gear)

ggplot(
  data = ungrouped_data,
  aes(x = gear, y = mean_mpg)
) +
  geom_col()

vs.

plot_dat <- mtcars |> 
  group_by(gear) |> 
  summarise(mean_mpg = mean(mpg)) |> 
  ungroup()

plot_dat |> 
  ggplot(aes(x = gear, y = mean_mpg)) +
  geom_col()

Data manipulation with `tidyverse`

Data ingestion

Data comes in different formats (.xlsx, .csv, .tsv)
readr is a package for reading rectangular text data files
readxl is a package for reading Excel files
Both packages automatically read the data into a tibble

Data ingestion

Create a new R script
Import tidyverse
Load the covid_cases.csv data

# install.packages("tidyverse")
library(tidyverse)

covid_cases <- read_csv("data/covid_cases.csv")

# check if `covid_cases` is a tibble
# (`read_csv()` is part of the `tidyverse` so it automatically converts the dataset into a `tibble`)
is_tibble(covid_cases)

[1] TRUE

covid_cases

# A tibble: 92 × 211
   date       cases_chn cases_kor cases_aus cases_jpn cases_mys cases_phl
   <date>         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 2020-01-20         0         0         0         0         0         0
 2 2020-01-21        31         0         0         0         0         0
 3 2020-01-23       262         0         0         0         0         0
 4 2020-01-24       259         1         0         0         0         0
 5 2020-01-25       467         0         3         2         0         0
 6 2020-01-26       688         0         1         0         3         0
 7 2020-01-27       776         2         0         1         1         0
 8 2020-01-28      1776         0         1         2         0         0
 9 2020-01-29      1460         0         2         1         0         0
10 2020-01-30      1739         0         0         4         3         1
# ℹ 82 more rows
# ℹ 204 more variables: cases_sgp <dbl>, cases_nzl <dbl>, cases_vnm <dbl>,
#   cases_brn <dbl>, cases_khm <dbl>, cases_mng <dbl>, cases_fji <dbl>,
#   cases_lao <dbl>, cases_png <dbl>, cases_gum <dbl>, cases_pyf <dbl>,
#   cases_ncl <dbl>, cases_mnp <dbl>, cases_idn <dbl>, cases_tha <dbl>,
#   cases_ind <dbl>, cases_lka <dbl>, cases_mdv <dbl>, cases_bgd <dbl>,
#   cases_btn <dbl>, cases_npl <dbl>, cases_mmr <dbl>, cases_tls <dbl>, …

Data ingestion

You can quickly skim at the data using the skimr package

# install.packages("skimr")
skimr::skim(covid_cases)

Data summary
Name	covid_cases
Number of rows	92
Number of columns	211
_______________________
Column type frequency:
Date	1
numeric	210
________________________
Group variables	None

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2020-01-20	2020-04-21	2020-03-06	92

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cases_chn	0	1.00	912.74	2206.54	0	65.50	128.0	950.50	19461	▇▁▁▁▁
cases_kor	0	1.00	116.11	172.75	0	1.00	50.0	116.75	813	▇▁▁▁▁
cases_aus	0	1.00	72.01	132.25	-9	0.00	5.5	79.25	650	▇▁▁▁▁
cases_jpn	0	1.00	120.84	189.92	-8	3.75	26.5	127.25	743	▇▁▁▁▁
cases_mys	0	1.00	58.96	78.41	0	0.00	2.5	130.00	315	▇▂▂▁▁
cases_phl	1	0.99	70.98	115.50	0	0.00	1.0	100.00	538	▇▁▁▁▁
cases_sgp	0	1.00	87.11	216.70	0	2.00	7.5	51.25	1426	▇▁▁▁▁
cases_nzl	0	1.00	12.03	22.49	0	0.00	0.0	13.25	87	▇▁▁▁▁
cases_vnm	0	1.00	2.91	4.53	0	0.00	1.0	4.00	19	▇▂▁▁▁
cases_brn	1	0.99	1.52	3.49	0	0.00	0.0	1.00	17	▇▁▁▁▁
cases_khm	0	1.00	1.33	3.94	-2	0.00	0.0	1.00	31	▇▁▁▁▁
cases_mng	0	1.00	0.36	1.47	0	0.00	0.0	0.00	13	▇▁▁▁▁
cases_fji	0	1.00	0.20	0.65	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_lao	0	1.00	0.21	0.64	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_png	0	1.00	0.08	0.54	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_gum	0	1.00	1.45	3.24	-3	0.00	0.0	1.25	19	▇▂▁▁▁
cases_pyf	0	1.00	0.61	1.50	0	0.00	0.0	0.00	8	▇▁▁▁▁
cases_ncl	0	1.00	0.20	0.67	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_mnp	0	1.00	0.15	0.63	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_idn	0	1.00	73.48	132.25	0	0.00	0.0	106.25	729	▇▁▁▁▁
cases_tha	0	1.00	30.53	59.56	0	0.00	1.5	33.00	310	▇▁▁▁▁
cases_ind	0	1.00	202.18	400.26	0	0.00	1.0	89.00	1553	▇▁▁▁▁
cases_lka	0	1.00	3.30	6.12	0	0.00	0.0	5.00	33	▇▁▁▁▁
cases_mdv	0	1.00	0.73	2.59	0	0.00	0.0	0.00	17	▇▁▁▁▁
cases_bgd	0	1.00	32.04	88.77	0	0.00	0.0	3.50	492	▇▁▁▁▁
cases_btn	0	1.00	0.07	0.25	0	0.00	0.0	0.00	1	▇▁▁▁▁
cases_npl	0	1.00	0.34	1.58	0	0.00	0.0	0.00	14	▇▁▁▁▁
cases_mmr	1	0.99	1.08	2.77	0	0.00	0.0	0.00	13	▇▁▁▁▁
cases_tls	0	1.00	0.25	1.38	0	0.00	0.0	0.00	12	▇▁▁▁▁
cases_usa	1	0.99	8056.92	12354.68	0	0.00	6.0	17440.50	35386	▇▁▁▁▂
cases_can	0	1.00	384.60	600.99	0	0.00	5.0	778.75	1902	▇▁▁▂▁
cases_bra	0	1.00	420.15	768.31	0	0.00	0.5	490.75	3257	▇▁▁▁▁
cases_chl	0	1.00	114.21	205.81	-381	0.00	0.0	236.50	1158	▁▇▂▁▁
cases_ecu	0	1.00	110.09	278.70	0	0.00	0.0	104.50	2196	▇▁▁▁▁
cases_mex	0	1.00	89.79	165.05	0	0.00	0.0	119.50	764	▇▁▁▁▁
cases_dom	1	0.99	54.55	106.97	0	0.00	0.0	52.50	393	▇▁▁▁▁
cases_pan	0	1.00	48.55	79.45	0	0.00	0.0	100.50	279	▇▁▁▁▁
cases_per	0	1.00	169.87	446.62	0	0.00	0.0	67.25	2784	▇▁▁▁▁
cases_arg	0	1.00	32.17	50.89	0	0.00	0.0	68.75	277	▇▂▁▁▁
cases_col	0	1.00	41.22	68.59	0	0.00	0.0	70.75	274	▇▁▁▁▁
cases_cri	0	1.00	7.17	11.09	0	0.00	0.0	16.00	41	▇▁▂▁▁
cases_ury	0	1.00	5.74	11.99	-29	0.00	0.0	9.00	55	▁▇▃▁▁
cases_cub	1	0.99	11.95	20.15	0	0.00	0.0	17.50	63	▇▁▁▁▁
cases_hnd	0	1.00	5.18	10.92	0	0.00	0.0	4.00	47	▇▁▁▁▁
cases_ven	1	0.99	2.81	6.94	0	0.00	0.0	0.00	34	▇▁▁▁▁
cases_bol	1	0.99	6.20	11.81	0	0.00	0.0	7.50	54	▇▁▁▁▁
cases_tto	0	1.00	1.24	4.51	0	0.00	0.0	1.00	41	▇▁▁▁▁
cases_pry	0	1.00	2.26	4.36	-3	0.00	0.0	3.25	25	▇▂▁▁▁
cases_gtm	0	1.00	3.14	6.55	0	0.00	0.0	3.00	32	▇▁▁▁▁
cases_jam	0	1.00	2.13	5.38	0	0.00	0.0	2.00	32	▇▁▁▁▁
cases_slv	0	1.00	2.37	4.40	0	0.00	0.0	2.00	17	▇▁▁▁▁
cases_brb	0	1.00	0.82	2.20	0	0.00	0.0	0.00	12	▇▁▁▁▁
cases_guy	0	1.00	0.71	1.96	-5	0.00	0.0	0.00	10	▁▇▁▁▁
cases_hti	0	1.00	0.51	1.41	0	0.00	0.0	0.00	9	▇▁▁▁▁
cases_lca	0	1.00	0.16	0.70	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_dma	0	1.00	0.17	0.75	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_grd	0	1.00	0.14	0.74	-1	0.00	0.0	0.00	6	▇▁▁▁▁
cases_sur	0	1.00	0.11	0.50	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_kna	0	1.00	0.16	0.77	-2	0.00	0.0	0.00	6	▁▇▁▁▁
cases_atg	0	1.00	0.25	1.07	0	0.00	0.0	0.00	8	▇▁▁▁▁
cases_nic	0	1.00	0.10	0.36	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_blz	0	1.00	0.20	0.62	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_vct	0	1.00	0.13	0.62	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_pri	0	1.00	13.61	26.70	0	0.00	0.0	8.00	110	▇▁▁▁▁
cases_mtq	0	1.00	1.77	4.24	0	0.00	0.0	1.00	27	▇▁▁▁▁
cases_glp	0	1.00	1.61	3.73	0	0.00	0.0	1.00	20	▇▁▁▁▁
cases_abw	0	1.00	1.05	2.64	0	0.00	0.0	0.25	18	▇▁▁▁▁
cases_guf	0	1.00	1.05	2.08	0	0.00	0.0	1.00	12	▇▁▁▁▁
cases_vir	0	1.00	0.58	1.72	-2	0.00	0.0	0.00	11	▇▁▁▁▁
cases_bmu	0	1.00	0.93	3.01	0	0.00	0.0	0.00	24	▇▁▁▁▁
cases_cym	0	1.00	0.66	1.82	0	0.00	0.0	0.00	8	▇▁▁▁▁
cases_sxm	0	1.00	0.73	2.28	0	0.00	0.0	0.00	14	▇▁▁▁▁
cases_maf	0	1.00	0.40	1.64	-6	0.00	0.0	0.00	12	▁▇▁▁▁
cases_cuw	0	1.00	0.15	0.57	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_blm	0	1.00	0.07	0.32	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_msr	0	1.00	0.12	0.47	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_tca	0	1.00	0.12	0.44	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_aia	0	1.00	0.03	0.23	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_vgb	0	1.00	0.04	0.25	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_bes	0	1.00	0.05	0.27	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_flk	0	1.00	0.12	0.71	0	0.00	0.0	0.00	6	▇▁▁▁▁
cases_spm	0	1.00	0.01	0.10	0	0.00	0.0	0.00	1	▇▁▁▁▁
cases_esp	0	1.00	2176.20	2853.62	0	0.00	64.0	4334.00	9222	▇▁▂▁▁
cases_ita	0	1.00	1969.87	2143.42	0	0.00	773.5	3864.75	6557	▇▁▂▂▁
cases_deu	0	1.00	1559.32	2122.93	0	0.00	88.5	2934.50	7324	▇▂▁▁▁
cases_fra	0	1.00	1233.84	1723.43	0	0.00	115.5	2152.00	7500	▇▂▂▁▁
cases_gbr	0	1.00	1355.95	2097.38	0	0.00	40.5	2564.25	8719	▇▁▂▁▁
cases_tur	0	1.00	988.91	1644.14	0	0.00	0.0	1745.25	5138	▇▁▁▁▁
cases_bel	0	1.00	434.60	624.56	0	0.00	4.0	967.75	2454	▇▁▂▁▁
cases_che	0	1.00	302.85	416.18	0	0.00	13.5	577.75	1774	▇▁▂▁▁
cases_nld	0	1.00	363.10	467.53	0	0.00	45.0	846.75	1335	▇▁▁▂▂
cases_prt	0	1.00	226.77	333.58	0	0.00	0.0	473.50	1516	▇▁▂▁▁
cases_aut	0	1.00	160.68	248.51	0	0.00	11.5	242.50	1141	▇▂▁▁▁
cases_rus	0	1.00	573.51	1498.00	0	0.00	0.0	222.00	9910	▇▁▁▁▁
cases_isr	0	1.00	150.90	231.99	0	0.00	2.0	272.50	932	▇▂▁▁▁
cases_swe	0	1.00	160.62	216.14	0	0.00	25.0	304.00	726	▇▁▁▁▁
cases_irl	0	1.00	170.13	281.64	0	0.00	1.0	257.50	1169	▇▂▁▁▁
cases_nor	0	1.00	77.32	101.16	0	0.00	15.5	132.25	425	▇▂▂▁▁
cases_dnk	0	1.00	81.68	108.94	0	0.00	5.0	158.50	390	▇▂▂▁▁
cases_pol	0	1.00	104.27	153.94	0	0.00	0.5	200.75	545	▇▁▁▁▁
cases_cze	0	1.00	75.15	105.36	0	0.00	4.5	130.75	408	▇▂▁▁▁
cases_rou	0	1.00	97.13	147.03	0	0.00	1.0	190.50	523	▇▁▁▁▁
cases_lux	0	1.00	38.67	62.26	0	0.00	0.0	64.50	234	▇▁▁▁▁
cases_srb	0	1.00	72.07	123.95	0	0.00	0.0	81.25	445	▇▁▁▁▁
cases_fin	0	1.00	42.04	61.39	0	0.00	0.5	76.50	267	▇▃▁▁▁
cases_ukr	0	1.00	66.58	124.35	0	0.00	0.0	68.25	500	▇▁▁▁▁
cases_grc	0	1.00	24.40	35.87	0	0.00	0.0	49.00	139	▇▁▂▁▁
cases_isl	0	1.00	19.27	29.50	0	0.00	0.0	26.25	99	▇▁▁▁▁
cases_hrv	0	1.00	20.45	29.25	0	0.00	1.0	41.25	96	▇▁▂▁▁
cases_mda	0	1.00	27.70	49.17	0	0.00	0.0	29.00	222	▇▁▁▁▁
cases_est	0	1.00	16.68	27.84	0	0.00	0.0	26.75	134	▇▂▁▁▁
cases_hun	0	1.00	22.80	37.95	0	0.00	0.0	39.00	210	▇▂▁▁▁
cases_svn	0	1.00	14.51	19.96	0	0.00	0.5	28.00	84	▇▂▂▁▁
cases_blr	0	1.00	68.09	208.36	0	0.00	0.0	5.75	1485	▇▁▁▁▁
cases_ltu	0	1.00	14.67	25.02	0	0.00	0.0	28.50	122	▇▂▁▁▁
cases_arm	0	1.00	14.55	23.22	0	0.00	0.0	26.50	96	▇▁▂▁▁
cases_aze	0	1.00	15.61	26.14	0	0.00	0.0	25.00	105	▇▁▁▁▁
cases_bih	0	1.00	14.13	21.62	0	0.00	0.0	30.00	86	▇▁▁▁▁
cases_kaz	0	1.00	20.13	46.24	0	0.00	0.0	18.50	306	▇▁▁▁▁
cases_svk	0	1.00	12.75	22.27	0	0.00	0.0	14.00	114	▇▁▁▁▁
cases_mkd	1	0.99	13.46	22.99	0	0.00	0.0	22.00	117	▇▂▁▁▁
cases_bgr	0	1.00	10.10	14.01	-3	0.00	0.0	18.25	53	▇▂▂▂▁
cases_uzb	0	1.00	18.01	38.09	0	0.00	0.0	15.50	172	▇▁▁▁▁
cases_and	0	1.00	7.80	13.08	0	0.00	0.0	13.00	58	▇▂▁▁▁
cases_lva	0	1.00	8.03	12.94	0	0.00	0.0	13.25	71	▇▂▁▁▁
cases_cyp	0	1.00	8.39	12.89	0	0.00	0.0	16.25	58	▇▂▁▁▁
cases_alb	0	1.00	6.62	12.01	0	0.00	0.0	9.25	56	▇▁▁▁▁
cases_smr	0	1.00	5.02	8.92	0	0.00	0.0	7.00	36	▇▁▁▁▁
cases_mlt	0	1.00	4.68	8.22	0	0.00	0.0	7.25	52	▇▁▁▁▁
cases_kgz	0	1.00	6.41	13.01	0	0.00	0.0	5.00	69	▇▂▁▁▁
cases_mne	0	1.00	3.39	6.89	0	0.00	0.0	3.25	37	▇▁▁▁▁
cases_geo	0	1.00	4.43	7.29	0	0.00	0.0	6.00	34	▇▁▁▁▁
cases_lie	0	1.00	0.89	2.50	0	0.00	0.0	1.00	18	▇▁▁▁▁
cases_mco	0	1.00	0.74	6.38	-30	0.00	0.0	0.00	39	▁▁▇▁▁
cases_vat	0	1.00	0.10	0.42	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_fro	0	1.00	2.01	5.24	-1	0.00	0.0	1.00	36	▇▁▁▁▁
cases_ggy	0	1.00	2.60	5.37	0	0.00	0.0	2.25	23	▇▁▁▁▁
cases_jey	0	1.00	2.71	6.61	-2	0.00	0.0	2.00	37	▇▁▁▁▁
cases_imn	0	1.00	3.23	7.49	0	0.00	0.0	1.25	43	▇▁▁▁▁
cases_gib	0	1.00	1.45	3.79	0	0.00	0.0	0.00	20	▇▁▁▁▁
cases_grl	0	1.00	0.12	0.47	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_irn	0	1.00	907.66	1003.20	0	0.00	593.0	1582.00	3186	▇▂▂▂▂
cases_pak	0	1.00	100.17	170.27	0	0.00	0.0	162.75	798	▇▂▁▁▁
cases_sau	0	1.00	113.96	236.68	-1	0.00	0.0	113.50	1132	▇▁▁▁▁
cases_qat	0	1.00	65.38	125.28	0	0.00	1.0	60.25	567	▇▁▁▁▁
cases_egy	1	0.99	36.63	55.04	0	0.00	0.0	46.50	189	▇▂▁▁▁
cases_irq	0	1.00	17.11	23.84	0	0.00	4.5	30.25	91	▇▂▁▁▁
cases_are	0	1.00	78.97	145.82	0	0.00	0.0	55.50	484	▇▁▁▁▁
cases_mar	0	1.00	33.11	58.20	0	0.00	0.0	51.25	281	▇▂▁▁▁
cases_bhr	0	1.00	20.73	38.78	0	0.00	2.5	23.75	226	▇▁▁▁▁
cases_lbn	0	1.00	7.36	10.72	0	0.00	0.5	11.00	43	▇▂▁▁▁
cases_tun	0	1.00	9.61	15.28	0	0.00	0.0	15.75	59	▇▂▁▁▁
cases_jor	0	1.00	4.62	8.13	0	0.00	0.0	6.00	40	▇▁▁▁▁
cases_kwt	1	0.99	21.92	40.54	0	0.00	0.0	19.00	164	▇▁▁▁▁
cases_omn	0	1.00	16.39	32.02	0	0.00	0.0	16.50	144	▇▁▁▁▁
cases_afg	0	1.00	11.15	21.14	0	0.00	0.0	10.00	105	▇▁▁▁▁
cases_dji	0	1.00	9.20	27.84	0	0.00	0.0	1.25	156	▇▁▁▁▁
cases_syr	0	1.00	0.42	1.31	0	0.00	0.0	0.00	6	▇▁▁▁▁
cases_lby	0	1.00	0.55	1.89	0	0.00	0.0	0.00	13	▇▁▁▁▁
cases_sdn	0	1.00	1.16	4.70	-1	0.00	0.0	0.00	33	▇▁▁▁▁
cases_som	0	1.00	2.58	9.93	0	0.00	0.0	0.00	73	▇▁▁▁▁
cases_yem	0	1.00	0.01	0.10	0	0.00	0.0	0.00	1	▇▁▁▁▁
cases_pse	0	1.00	3.58	6.75	0	0.00	0.0	4.00	31	▇▁▁▁▁
cases_zaf	0	1.00	35.87	59.43	0	0.00	0.0	54.25	251	▇▂▁▁▁
cases_dza	0	1.00	29.54	54.26	0	0.00	0.0	42.75	265	▇▂▁▁▁
cases_bfa	1	0.99	5.49	11.28	0	0.00	0.0	3.50	54	▇▁▁▁▁
cases_civ	0	1.00	9.55	22.19	0	0.00	0.0	3.00	105	▇▁▁▁▁
cases_sen	0	1.00	4.10	6.10	0	0.00	0.0	7.25	23	▇▂▂▁▁
cases_gha	0	1.00	11.33	35.87	0	0.00	0.0	2.25	208	▇▁▁▁▁
cases_cmr	0	1.00	12.64	44.62	-2	0.00	0.0	0.00	309	▇▁▁▁▁
cases_nga	0	1.00	5.88	19.39	-2	0.00	0.0	0.25	168	▇▁▁▁▁
cases_mus	0	1.00	3.57	8.50	0	0.00	0.0	1.25	41	▇▁▁▁▁
cases_cod	0	1.00	3.80	7.00	0	0.00	0.0	6.00	40	▇▁▁▁▁
cases_rwa	0	1.00	1.60	3.21	0	0.00	0.0	2.00	19	▇▂▁▁▁
cases_mdg	0	1.00	1.32	3.07	0	0.00	0.0	0.00	15	▇▁▁▁▁
cases_ken	0	1.00	3.05	6.58	0	0.00	0.0	2.00	31	▇▁▁▁▁
cases_zmb	0	1.00	0.71	2.42	0	0.00	0.0	0.00	19	▇▁▁▁▁
cases_tgo	0	1.00	0.91	2.36	-1	0.00	0.0	0.25	14	▇▁▁▁▁
cases_uga	0	1.00	0.60	2.50	0	0.00	0.0	0.00	19	▇▁▁▁▁
cases_eth	0	1.00	1.21	2.26	0	0.00	0.0	2.00	9	▇▁▁▁▁
cases_ner	0	1.00	7.12	17.41	0	0.00	0.0	0.00	94	▇▁▁▁▁
cases_cog	0	1.00	1.74	6.20	0	0.00	0.0	0.00	43	▇▁▁▁▁
cases_tza	0	1.00	2.77	11.24	0	0.00	0.0	0.00	84	▇▁▁▁▁
cases_mli	0	1.00	2.67	6.50	0	0.00	0.0	0.00	29	▇▁▁▁▁
cases_gin	0	1.00	6.76	20.34	0	0.00	0.0	0.00	145	▇▁▁▁▁
cases_gnq	0	1.00	0.86	3.77	0	0.00	0.0	0.00	28	▇▁▁▁▁
cases_nam	0	1.00	0.17	0.59	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_swz	0	1.00	0.26	0.74	0	0.00	0.0	0.00	3	▇▁▁▁▁
cases_moz	0	1.00	0.38	1.21	0	0.00	0.0	0.00	7	▇▁▁▁▁
cases_syc	0	1.00	0.12	0.44	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_gab	0	1.00	1.30	3.95	0	0.00	0.0	0.00	23	▇▁▁▁▁
cases_ben	0	1.00	0.59	2.17	0	0.00	0.0	0.00	17	▇▁▁▁▁
cases_caf	0	1.00	0.13	0.40	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_eri	0	1.00	0.42	1.52	0	0.00	0.0	0.00	9	▇▁▁▁▁
cases_cpv	0	1.00	0.60	4.70	0	0.00	0.0	0.00	45	▇▁▁▁▁
cases_tcd	0	1.00	0.36	1.21	0	0.00	0.0	0.00	7	▇▁▁▁▁
cases_mrt	0	1.00	0.08	0.31	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_zwe	0	1.00	0.27	0.79	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_gmb	0	1.00	0.11	0.56	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_lbr	0	1.00	1.08	3.03	0	0.00	0.0	0.00	17	▇▁▁▁▁
cases_ago	0	1.00	0.26	0.92	0	0.00	0.0	0.00	5	▇▁▁▁▁
cases_gnb	1	0.99	0.52	1.94	0	0.00	0.0	0.00	15	▇▁▁▁▁
cases_bwa	0	1.00	0.22	0.98	0	0.00	0.0	0.00	7	▇▁▁▁▁
cases_mwi	0	1.00	0.18	0.69	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_stp	0	1.00	0.04	0.42	0	0.00	0.0	0.00	4	▇▁▁▁▁
cases_bdi	0	1.00	0.07	0.32	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_sle	0	1.00	0.47	1.59	0	0.00	0.0	0.00	11	▇▁▁▁▁
cases_ssd	0	1.00	0.04	0.25	0	0.00	0.0	0.00	2	▇▁▁▁▁
cases_reu	0	1.00	4.43	10.29	-1	0.00	0.0	4.00	64	▇▁▁▁▁
cases_myt	0	1.00	3.09	7.10	0	0.00	0.0	1.00	39	▇▁▁▁▁

Pivot `tibble`s

Does this data follow the tidy data principle?

covid_cases

# A tibble: 92 × 211
   date       cases_chn cases_kor cases_aus cases_jpn cases_mys cases_phl
   <date>         <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
 1 2020-01-20         0         0         0         0         0         0
 2 2020-01-21        31         0         0         0         0         0
 3 2020-01-23       262         0         0         0         0         0
 4 2020-01-24       259         1         0         0         0         0
 5 2020-01-25       467         0         3         2         0         0
 6 2020-01-26       688         0         1         0         3         0
 7 2020-01-27       776         2         0         1         1         0
 8 2020-01-28      1776         0         1         2         0         0
 9 2020-01-29      1460         0         2         1         0         0
10 2020-01-30      1739         0         0         4         3         1
# ℹ 82 more rows
# ℹ 204 more variables: cases_sgp <dbl>, cases_nzl <dbl>, cases_vnm <dbl>,
#   cases_brn <dbl>, cases_khm <dbl>, cases_mng <dbl>, cases_fji <dbl>,
#   cases_lao <dbl>, cases_png <dbl>, cases_gum <dbl>, cases_pyf <dbl>,
#   cases_ncl <dbl>, cases_mnp <dbl>, cases_idn <dbl>, cases_tha <dbl>,
#   cases_ind <dbl>, cases_lka <dbl>, cases_mdv <dbl>, cases_bgd <dbl>,
#   cases_btn <dbl>, cases_npl <dbl>, cases_mmr <dbl>, cases_tls <dbl>, …

What do we want? Less columns more rows -> We want to pivot it longer

Pivot `tibble`s

Let’s use pivot_longer() to turn our data into tidy data

covid_cases <- covid_cases |> 
  pivot_longer(
    cols = -date,
    names_to = "country", 
    names_pattern = "cases_(.+)",
    values_to = "cases"
  )
covid_cases

# A tibble: 19,320 × 3
   date       country cases
   <date>     <chr>   <dbl>
 1 2020-01-20 chn         0
 2 2020-01-20 kor         0
 3 2020-01-20 aus         0
 4 2020-01-20 jpn         0
 5 2020-01-20 mys         0
 6 2020-01-20 phl         0
 7 2020-01-20 sgp         0
 8 2020-01-20 nzl         0
 9 2020-01-20 vnm         0
10 2020-01-20 brn         0
# ℹ 19,310 more rows

Pivot `tibble`s

Now that the data is a tidy data tibble, let’s take a look at it again

skimr::skim(covid_cases)

Data summary
Name	covid_cases
Number of rows	19320
Number of columns	3
_______________________
Column type frequency:
character	1
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	3	3	0	210	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2020-01-20	2020-04-21	2020-03-06	92

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cases	13	1	123.14	1128.03	-381	0	0	3	35386	▇▁▁▁▁

Handling `NA`s

Let’s check the NA in cases
You can filter your tibble with… filter()

covid_cases |> filter(is.na(cases))

# A tibble: 13 × 3
   date       country cases
   <date>     <chr>   <dbl>
 1 2020-01-20 bol        NA
 2 2020-01-25 egy        NA
 3 2020-01-27 brn        NA
 4 2020-01-27 cub        NA
 5 2020-02-02 ven        NA
 6 2020-02-05 dom        NA
 7 2020-02-21 kwt        NA
 8 2020-02-26 phl        NA
 9 2020-02-29 mkd        NA
10 2020-03-29 usa        NA
11 2020-03-30 bfa        NA
12 2020-04-14 mmr        NA
13 2020-04-16 gnb        NA

Handling `NA`s

Make your decision:
- Remove rows with NAs
- Data imputation, i.e. replace the NAs with something that makes sense like mean, mode, median, interpolation,…
- Something else?

Handling `NA`s

Remove rows with NAs using drop_na()

covid_cases |> drop_na()

# A tibble: 19,307 × 3
   date       country cases
   <date>     <chr>   <dbl>
 1 2020-01-20 chn         0
 2 2020-01-20 kor         0
 3 2020-01-20 aus         0
 4 2020-01-20 jpn         0
 5 2020-01-20 mys         0
 6 2020-01-20 phl         0
 7 2020-01-20 sgp         0
 8 2020-01-20 nzl         0
 9 2020-01-20 vnm         0
10 2020-01-20 brn         0
# ℹ 19,297 more rows

Handling `NA`s

You can further specify the column to drop NAs from

covid_cases |> drop_na(cases)

# A tibble: 19,307 × 3
   date       country cases
   <date>     <chr>   <dbl>
 1 2020-01-20 chn         0
 2 2020-01-20 kor         0
 3 2020-01-20 aus         0
 4 2020-01-20 jpn         0
 5 2020-01-20 mys         0
 6 2020-01-20 phl         0
 7 2020-01-20 sgp         0
 8 2020-01-20 nzl         0
 9 2020-01-20 vnm         0
10 2020-01-20 brn         0
# ℹ 19,297 more rows

Handling `NA`s

You can perform data imputation with mutate()

median_case <- median(covid_cases$cases, na.rm = TRUE)
covid_no_na_cases <- covid_cases |> 
  replace_na(list(cases = median_case))

skimr::skim(covid_no_na_cases)

Data summary
Name	covid_no_na_cases
Number of rows	19320
Number of columns	3
_______________________
Column type frequency:
character	1
Date	1
numeric	1
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
country	0	1	3	3	0	210	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
date	0	1	2020-01-20	2020-04-21	2020-03-06	92

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
cases	0	1	123.05	1127.65	-381	0	0	3	35386	▇▁▁▁▁

Select columns

You can select columns from a tibble with select()

covid_cases |> select(country, cases)

# A tibble: 19,320 × 2
   country cases
   <chr>   <dbl>
 1 chn         0
 2 kor         0
 3 aus         0
 4 jpn         0
 5 mys         0
 6 phl         0
 7 sgp         0
 8 nzl         0
 9 vnm         0
10 brn         0
# ℹ 19,310 more rows

Select columns

You can unselect columns with the - operator

covid_cases |>
  select(-country)

# A tibble: 19,320 × 2
   date       cases
   <date>     <dbl>
 1 2020-01-20     0
 2 2020-01-20     0
 3 2020-01-20     0
 4 2020-01-20     0
 5 2020-01-20     0
 6 2020-01-20     0
 7 2020-01-20     0
 8 2020-01-20     0
 9 2020-01-20     0
10 2020-01-20     0
# ℹ 19,310 more rows

covid_cases |>
  select(-cases)

# A tibble: 19,320 × 2
   date       country
   <date>     <chr>  
 1 2020-01-20 chn    
 2 2020-01-20 kor    
 3 2020-01-20 aus    
 4 2020-01-20 jpn    
 5 2020-01-20 mys    
 6 2020-01-20 phl    
 7 2020-01-20 sgp    
 8 2020-01-20 nzl    
 9 2020-01-20 vnm    
10 2020-01-20 brn    
# ℹ 19,310 more rows

Transforming `tibble`s

Before we start

# install.packages("readxl")
library(readxl)
titanic_df <- read_excel("data/Titanic3.xlsx")

Create and/or modify columns

You can create new columns or modify existing columns with mutate()
Example: modify columns to their correct data type and create a new column for year of birth

titanic_df |>
  select(pclass, survived, sex, age) |>
  mutate(
    pclass = as.factor(pclass),
    survived = as.factor(survived),
    sex = as.factor(sex),
    age = as.numeric(age),
    yob = 1912 - age
  )

# A tibble: 1,309 × 5
   pclass survived sex       age   yob
   <fct>  <fct>    <fct>   <dbl> <dbl>
 1 1st    1        female 29     1883 
 2 1st    1        male    0.917 1911.
 3 1st    0        female  2     1910 
 4 1st    0        male   30     1882 
 5 1st    0        female 25     1887 
 6 1st    1        male   48     1864 
 7 1st    1        female 63     1849 
 8 1st    0        male   39     1873 
 9 1st    1        female 53     1859 
10 1st    0        male   71     1841 
# ℹ 1,299 more rows

Create and/or modify columns

You can also work on the same column multiple times in one mutate(), the modifications will be run sequentially

titanic_df |>
  select(pclass, survived, sex, age) |>
  mutate(
    pclass = as.factor(pclass),
    survived = as.factor(survived),
    sex = as.factor(sex),
    age = as.numeric(age),
    age = round(age),
    yob = 1912 - age
  )

# A tibble: 1,309 × 5
   pclass survived sex      age   yob
   <fct>  <fct>    <fct>  <dbl> <dbl>
 1 1st    1        female    29  1883
 2 1st    1        male       1  1911
 3 1st    0        female     2  1910
 4 1st    0        male      30  1882
 5 1st    0        female    25  1887
 6 1st    1        male      48  1864
 7 1st    1        female    63  1849
 8 1st    0        male      39  1873
 9 1st    1        female    53  1859
10 1st    0        male      71  1841
# ℹ 1,299 more rows

Create and/or modify columns

To run the same function for multiple columns, you can use across()

titanic_df |>
  select(pclass, survived, sex, age) |>
  mutate(
    across(c(pclass, survived, sex), as.factor),
    age = as.numeric(age),
    age = round(age),
    yob = 1912 - age
  )

# A tibble: 1,309 × 5
   pclass survived sex      age   yob
   <fct>  <fct>    <fct>  <dbl> <dbl>
 1 1st    1        female    29  1883
 2 1st    1        male       1  1911
 3 1st    0        female     2  1910
 4 1st    0        male      30  1882
 5 1st    0        female    25  1887
 6 1st    1        male      48  1864
 7 1st    1        female    63  1849
 8 1st    0        male      39  1873
 9 1st    1        female    53  1859
10 1st    0        male      71  1841
# ℹ 1,299 more rows

Create and/or modify columns

To simply rename columns, you can use rename()

titanic_df |> rename(passenger_class = pclass)

# A tibble: 1,309 × 17
   passenger_class survived name      sex   age   sibsp parch ticket  fare cabin
   <chr>              <dbl> <chr>     <chr> <chr> <dbl> <dbl> <chr>  <dbl> <chr>
 1 1st                    1 Allen, M… fema… 29        0     0 24160  211.  B5   
 2 1st                    1 Allison,… male  0.91…     1     2 113781 152.  C22 …
 3 1st                    0 Allison,… fema… 2         1     2 113781 152.  C22 …
 4 1st                    0 Allison,… male  30        1     2 113781 152.  C22 …
 5 1st                    0 Allison,… fema… 25        1     2 113781 152.  C22 …
 6 1st                    1 Anderson… male  48        0     0 19952   26.5 E12  
 7 1st                    1 Andrews,… fema… 63        1     0 13502   78.0 D7   
 8 1st                    0 Andrews,… male  39        0     0 112050   0   A36  
 9 1st                    1 Appleton… fema… 53        2     0 11769   51.5 C101 
10 1st                    0 Artagave… male  71        0     0 PC 17…  49.5 <NA> 
# ℹ 1,299 more rows
# ℹ 7 more variables: embarked <chr>, boat <chr>, body <chr>, home_dest <chr>,
#   dob <chr>, family <chr>, agecat <chr>

Joining `tibble`s

Example: you have a tibble that maps the passenger class to its ticket name

pclass_names <- tribble(
  ~pclass, ~name,
  "1st", "First class",
  "2nd", "Business",
  "3rd", "Economy"
)

pclass_names

# A tibble: 3 × 2
  pclass name       
  <chr>  <chr>      
1 1st    First class
2 2nd    Business   
3 3rd    Economy

Joining `tibble`s

Now, for that to be a new column of the dataset, you can join the tibbles
… but which type of join do we want to use?

Joining `tibble`s

Important

Direction matters: the left tibble and right tibble are not interchangable
Make sure the key columns exist on both tibbles

Joining

For our example, we will do a left join, where the original tibble is the left, and the passenger class name tibble is on the right

titanic_df |> 
  select(pclass, name) |> 
  left_join(
    pclass_names,
    by = "pclass"
  )

# A tibble: 1,309 × 3
   pclass name.x                          name.y     
   <chr>  <chr>                           <chr>      
 1 1st    Allen, Miss. Elisabeth Walton   First class
 2 1st    Allison, Master. Hudson Trevor  First class
 3 1st    Allison, Miss. Helen Loraine    First class
 4 1st    Allison, Mr. Hudson Joshua Crei First class
 5 1st    Allison, Mrs. Hudson J C (Bessi First class
 6 1st    Anderson, Mr. Harry             First class
 7 1st    Andrews, Miss. Kornelia Theodos First class
 8 1st    Andrews, Mr. Thomas Jr          First class
 9 1st    Appleton, Mrs. Edward Dale (Cha First class
10 1st    Artagaveytia, Mr. Ramon         First class
# ℹ 1,299 more rows

Vectorise R functions

What is a vector?

A vector is a container of elements of similar classes
In math, $[1\space 2\space 3]$ is a vector of 3 integers
In R, you can define vectors by putting them inside c()
- c(1, 2, 3) is a vector of 3 numeric elements
- c("ab", "cd", "ef") is a vector of 3 character elements
- c(c(1, 2), c(2, 3), c(3, 4)) is a vector of 3 vector elements, each is a vector of 2 numeric elements

What is a vector?

Question: is 12 a vector?

class(12)

[1] "numeric"

class(c(12))

[1] "numeric"

class(c(12, 13, 14))

[1] "numeric"

In R, most things are vectors. A single number by itself is also a vector
12 itself is a vector that has one numeric element

What is a vector?

Conceptually, a tibble is a list of named vectors!
You can see it using the str() function

# covid_cases <- readr::read_rds("data/covid_cases.rds") |> 
#   pivot_longer(
#     cols = -date,
#     names_to = "country", 
#     names_pattern = "cases_(.+)",
#     values_to = "cases"
#   )
str(covid_cases)

tibble [19,320 × 3] (S3: tbl_df/tbl/data.frame)
 $ date   : Date[1:19320], format: "2020-01-20" "2020-01-20" ...
 $ country: chr [1:19320] "chn" "kor" "aus" "jpn" ...
 $ cases  : num [1:19320] 0 0 0 0 0 0 0 0 0 0 ...

covid_cases has 3 vectors:
- date is a vector of dates
- country is a vector of characters
- cases is a vector of numbers

What is a vector?

You can access vector elements with the $ operator

covid_cases$date[10000:10010]

 [1] "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08"
 [6] "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08"
[11] "2020-03-08"

covid_cases$country[10000:10010]

 [1] "kgz" "mne" "geo" "lie" "mco" "vat" "fro" "ggy" "jey" "imn" "gib"

covid_cases$cases[10000:10010]

 [1] 0 0 3 0 0 0 1 0 0 0 0

What is a vector?

Or the tidyverse way with dplyr::slice() and dplyr::pull()

covid_cases |> slice(10000:10010) |> pull(date)

 [1] "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08"
 [6] "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08" "2020-03-08"
[11] "2020-03-08"

covid_cases |> slice(10000:10010) |> pull(country)

 [1] "kgz" "mne" "geo" "lie" "mco" "vat" "fro" "ggy" "jey" "imn" "gib"

covid_cases |> slice(10000:10010) |> pull(cases)

 [1] 0 0 3 0 0 0 1 0 0 0 0

Note

one of the cases where using base R might be quicker and doesn’t sacrifice readability
dplyr and tidyverse equivalents

Vectorisation

Instead of going through each element and perform an action, we can apply an action to all elements at once
Example: Take the square root of a vector

data <- covid_cases$cases[10000:10010]

# before
data

 [1] 0 0 3 0 0 0 1 0 0 0 0

squared_data <- numeric() # create new object to hold new data
idx <- seq(1, length(data)) # create index vector from 1 to the length of data
for (i in idx) {
  squared_data[i] <- data[i]^2
}

# after
squared_data

 [1] 0 0 9 0 0 0 1 0 0 0 0

# before
squared_data

 [1] 0 0 9 0 0 0 1 0 0 0 0

# after
sqrt_data <- sqrt(squared_data)
sqrt_data

 [1] 0 0 3 0 0 0 1 0 0 0 0

# alternative using `sapply()`
sqrt_data <- sapply(squared_data, sqrt)
sqrt_data

 [1] 0 0 3 0 0 0 1 0 0 0 0

Note

Most functions math-related functions and operators in base R are already vectorised, e.g. sqrt(), log(), exp(), +, -, *, /

Vectorisation

R has a functional style, which means vectorisation is more intuitive to write and read code
On a technical level:
- R itself might be slightly faster when doing vectorisation, compared to for-loops
- for more complex and time-consuming functions, it is easier to paralellise with vectorisation

Vectorisation

Vectorisation shines when there are more complex actions, and you have to write your own functions
Example: Assume that data is a vector of circle diameters, take its square roots and calculate the surface areas with vectorisation using sapply()

area_from_diameter <- function(d) {
  return(3.14 * (sqrt(d) / 2)^2)
}
sapply(data, area_from_diameter)

 [1] 0.000 0.000 2.355 0.000 0.000 0.000 0.785 0.000 0.000 0.000 0.000

Vectorisation with `map()`

In the tidyverse, we can perform vectorisation with the purrr::map() family
Let’s check how it works with ?purrr::map()

Previous example using map()

map(data, area_from_diameter) |> list_c()

 [1] 0.000 0.000 2.355 0.000 0.000 0.000 0.785 0.000 0.000 0.000 0.000

Note

We do list_c() after a map() because map() is designed to take in a list and return a list. list_c() helps combine a list into a vector

Vectorisation with `map()`

If you have 2 vectors that you want to go through at the same time, you can use map2()
Example: data2 is a vector of side lengths of squares, I want the sum of surface areas from the circles and the squares

data2 <- c(29, 37, 22, 35, 39, 29, 30, 33, 43, 36, 26)

some_fn <- function(d, l){
  return(3.14 * (sqrt(d) / 2)^2) + (l^2)
}

map2(data, data2, some_fn) |> list_c()

 [1] 0.000 0.000 2.355 0.000 0.000 0.000 0.785 0.000 0.000 0.000 0.000

Vectorisation with `map()`

You can create new columns, or edit current columns, with complex actions when using map() with mutate()
Example: Using the mtcars dataset, create a new column called hp_p_wt, which is the horsepower per weight of each car in kilograms

some_other_fn <- function(h, w){
  h / (w / 2.205)
}

mtcars |> 
  mutate(
    hp_p_wt = map2(hp, wt, some_other_fn) |> unlist(),
    .before = mpg
  )

                      hp_p_wt  mpg cyl  disp  hp drat    wt  qsec vs am gear
Mazda RX4            92.57634 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4
Mazda RX4 Wag        84.36522 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4
Datsun 710           88.39009 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4
Hornet 4 Drive       75.44323 21.4   6 258.0 110 3.08 3.215 19.44  1  0    3
Hornet Sportabout   112.17297 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3
Valiant              66.91474 18.1   6 225.0 105 2.76 3.460 20.22  1  0    3
Duster 360          151.32353 14.3   8 360.0 245 3.21 3.570 15.84  0  0    3
Merc 240D            42.85580 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4
Merc 230             66.50000 22.8   4 140.8  95 3.92 3.150 22.90  1  0    4
Merc 280             78.84157 19.2   6 167.6 123 3.92 3.440 18.30  1  0    4
Merc 280C            78.84157 17.8   6 167.6 123 3.92 3.440 18.90  1  0    4
Merc 450SE           97.51843 16.4   8 275.8 180 3.07 4.070 17.40  0  0    3
Merc 450SL          106.40751 17.3   8 275.8 180 3.07 3.730 17.60  0  0    3
Merc 450SLC         105.00000 15.2   8 275.8 180 3.07 3.780 18.00  0  0    3
Cadillac Fleetwood   86.10000 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3
Lincoln Continental  87.40321 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3
Chrysler Imperial    94.88307 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3
Fiat 128             66.15000 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4
Honda Civic          70.99690 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4
Toyota Corolla       78.10627 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4
Toyota Corona        86.76876 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3
Dodge Challenger     93.96307 15.5   8 318.0 150 2.76 3.520 16.87  0  0    3
AMC Javelin          96.28821 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3
Camaro Z28          140.68359 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3
Pontiac Firebird    100.35761 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3
Fiat X1-9            75.20930 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4
Porsche 914-2        93.76402 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5
Lotus Europa        164.68275 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5
Ford Pantera L      183.63407 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5
Ferrari Dino        139.30505 19.7   6 145.0 175 3.62 2.770 15.50  0  1    5
Maserati Bora       206.91176 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5
Volvo 142E           86.45504 21.4   4 121.0 109 4.11 2.780 18.60  1  1    4
                    carb
Mazda RX4              4
Mazda RX4 Wag          4
Datsun 710             1
Hornet 4 Drive         1
Hornet Sportabout      2
Valiant                1
Duster 360             4
Merc 240D              2
Merc 230               2
Merc 280               4
Merc 280C              4
Merc 450SE             3
Merc 450SL             3
Merc 450SLC            3
Cadillac Fleetwood     4
Lincoln Continental    4
Chrysler Imperial      4
Fiat 128               1
Honda Civic            2
Toyota Corolla         1
Toyota Corona          1
Dodge Challenger       2
AMC Javelin            2
Camaro Z28             4
Pontiac Firebird       2
Fiat X1-9              1
Porsche 914-2          2
Lotus Europa           2
Ford Pantera L         4
Ferrari Dino           6
Maserati Bora          8
Volvo 142E             2

Appendix

Programming logics in R

If-else

R provides conditional logic: depending on the outcome of a test, execute a specific statement

if (logical statement) {
    do this
} else {
    do that
}

If-else

Logic statements are any statement (piece of R code) that provides a TRUE/FALSE result

24 < 50 # This is a logic statement

[1] TRUE

TRUE & FALSE # This is also a logic statement

[1] FALSE

"R programming" == "fun" # Still a logic statement

[1] FALSE

TRUE # Believe it or not, still a logic statement

[1] TRUE

100 %in% seq(50, 120, 2) # Also logic statement

[1] TRUE

If-else

Example:

x <- 10

if (x == 10) {
  print("x is 10")
} else {
  print("x is not 10")
}

[1] "x is 10"

If-else

You can infinitely* add more “branches” to if-else with else if

if (logical statement 1) {
    do this
} else if (logical statement 2) {
    do that
} else if (logical statement 3) {
    do something different
} else {
    do something other thing
}

*: it is technically possible but of course you should not add too much else ifs; make sure that your code is readable and understandable

If-else

Example:

x <- 24

if (x < 10) {
  print("x is less than 10")
} else if (x < 20) {
  print("x is greater than 10 and less than 20")
} else if (x < 30) {
  print("x is greater than 20 and less than 30")
} else {
  print("x is greater or equal to 30")
}

[1] "x is greater than 20 and less than 30"

If-else

Another example:

target_year <- 2025

if ((target_year %% 4 == 0 & target_year %% 100 != 0) |
  (target_year %% 400 == 0)) {
  print(paste0(target_year, " is a leap year."))
} else {
  print(paste0(target_year, " is not a leap year."))
}

[1] "2025 is not a leap year."

If-else

You can assign the resulting value if you want:

x <- 24

y <- if (x < 10) {
  "x is less than 10"
} else if (x < 20) {
  "x is greater than 10 and less than 20"
} else if (x < 30) {
  "x is greater than 20 and less than 30"
} else {
  "x is greater or equal to 30"
}

print(y)

[1] "x is greater than 20 and less than 30"

If-else

You may also stumble into ifelse()

print(ifelse(x > 10, "x is larger than 10", "x is smaller than 10"))

[1] "x is larger than 10"

# ?ifelse

It is a vectorized version of the typical if (...) {...} else {...} structure
That means, you need a vector of conditions. Imagine

For-loop

Imagine that you have to repeat the same analysis for many files that are all in the same folder on your computer
A solution for that might be an iterative construct like a for-loop:

files <- dir()

for (filename in files) {
    infile <- read.table(filename, ...)
    do something with `infile`
}

For-loop

An simple example simulating $R_0$:

vec <- rep(0, 50) # 50 time steps
vec[1] <- 2 # starting number of infectious
r0 <- 1.5 # basic reproduction number R_0

# loop from the 2nd time step to the end of vector
for (i in 2:length(vec)) {
  # get the number of infectious at current time step (starting at 2)
  vec[i] <- vec[i - 1] * r0
}

vec

 [1] 2.000000e+00 3.000000e+00 4.500000e+00 6.750000e+00 1.012500e+01
 [6] 1.518750e+01 2.278125e+01 3.417188e+01 5.125781e+01 7.688672e+01
[11] 1.153301e+02 1.729951e+02 2.594927e+02 3.892390e+02 5.838585e+02
[16] 8.757878e+02 1.313682e+03 1.970523e+03 2.955784e+03 4.433676e+03
[21] 6.650513e+03 9.975770e+03 1.496366e+04 2.244548e+04 3.366822e+04
[26] 5.050234e+04 7.575350e+04 1.136303e+05 1.704454e+05 2.556681e+05
[31] 3.835021e+05 5.752532e+05 8.628798e+05 1.294320e+06 1.941479e+06
[36] 2.912219e+06 4.368329e+06 6.552493e+06 9.828740e+06 1.474311e+07
[41] 2.211466e+07 3.317200e+07 4.975800e+07 7.463699e+07 1.119555e+08
[46] 1.679332e+08 2.518999e+08 3.778498e+08 5.667747e+08 8.501620e+08

For-loop

plot(vec)
lines(vec)

Conditionals within for-loops

Let’s combine if-else logics with for-loops:

vec <- seq(-10, 10)
neg_slope <- 0.01

for (i in 1:length(vec)) {
  x <- vec[i]
  if (x >= 0) {
    vec[i] <- x
  } else {
    vec[i] <- x * neg_slope
  }
}

vec

 [1] -0.10 -0.09 -0.08 -0.07 -0.06 -0.05 -0.04 -0.03 -0.02 -0.01  0.00  1.00
[13]  2.00  3.00  4.00  5.00  6.00  7.00  8.00  9.00 10.00

Conditionals within for-loops

plot(vec)
lines(vec)

Advance R topics

Topic overview

Into the tidyverse

Into the tidyverse

Into the tidyverse

Into the tidyverse

Into the tidyverse

Tidy data

Tidy data

Tidy data

Why tidyverse?

Into the tidyverse

Piping in R

Piping in R

Piping in R

Piping with |>

Piping with |>

Piping with |>

Piping with |>

Piping with |>

Piping with |>

Data manipulation with tidyverse

Data ingestion

Data ingestion

Data ingestion

Data ingestion

Pivot tibbles

Pivot tibbles

Pivot tibbles

Handling NAs

Handling NAs

Handling NAs

Handling NAs

Handling NAs

Select columns

Select columns

Transforming tibbles

Before we start

Create and/or modify columns

Create and/or modify columns

Create and/or modify columns

Create and/or modify columns

Joining tibbles

Joining tibbles

Joining tibbles

Joining tibbles

Joining

Vectorise R functions

What is a vector?

What is a vector?

What is a vector?

What is a vector?

What is a vector?

Vectorisation

Vectorisation

Vectorisation

Vectorisation with map()

Vectorisation with map()

Vectorisation with map()

Appendix

Programming logics in R

If-else

If-else

If-else

If-else

If-else

If-else

If-else

If-else

For-loop

For-loop

For-loop

Conditionals within for-loops

Conditionals within for-loops

Into the `tidyverse`

Into the `tidyverse`

Into the `tidyverse`

Into the `tidyverse`

Into the `tidyverse`

Why `tidyverse`?

Into the `tidyverse`

Piping with `|>`

Piping with `|>`

Piping with `|>`

Piping with `|>`

Piping with `|>`

Piping with `|>`

Data manipulation with `tidyverse`

Pivot `tibble`s

Pivot `tibble`s

Pivot `tibble`s

Handling `NA`s

Handling `NA`s

Handling `NA`s

Handling `NA`s

Handling `NA`s

Transforming `tibble`s

Joining `tibble`s

Joining `tibble`s

Joining `tibble`s

Joining `tibble`s

Vectorisation with `map()`

Vectorisation with `map()`

Vectorisation with `map()`