Exercises Day 5

Author

Tuyen Huynh and the team

Published

Last version: April 04, 2025

Notes and instructions

Piping

It is expected that the exercise answers will be mainly using the pipe operators |>, instead of putting functions inside each other

Dataset

We will use covid_cases.rds for this exercise.

Data dictionary

Column	Description
`date`	date of the case report
211 columns with the format `cases_{country_code}`	number of reported cases of each country and archipelago

Country codes of interest (i.e. Codes for countries needed for the exercise)

Code	Country name
`chn`	China
`usa`	United States of America
`vnm`	Vietnam
`sgp`	Singapore

Exercises

Task 1: Data import

Read data
Load the tidyverse meta-package into R

Note

A “meta-package” is a package that contains other packages. Remember, tidyverse is just a collection of packages.

Task 2: Data cleaning and filtering

Does the data follow the tidy data standard?
- If not, pivot the data into a tibble that follows the tidy data standard
Do some quick skim on the data. Do the variable values make sense given its type?
- If not, filter out incorrect/impossible data
Filter the data so that we only have week 3-12 of 2020
Save the results back into covid_cases

Tip: pivotting

If your data is not tidy data:
- “Lengthens” the data using tidyr::pivot_longer(), i.e. less columns, more rows
- “Widens” the data using tidyr::pivot_wider(), i.e. less rows, more columns

Tip: data cleaning

The main variable here is case counts from each country. They should have a numeric type and should not be lower than 0. You can’t have negative case counts
You can remove or filter undesirable values using dplyr::filter()

Task 3: Data transformation

Using the covid_cases object created in Task 2:
- Group the data and calculate the total number of cases per country
- Select the top 5 countries with highest total cases
- Extract the country codes and save them into a new object, e.g. top_countries

Use the covid_cases object created in Task 2 again:
- Group the data again, calculate the total number of cases per country again
- Modify the country column:
  - So that we only keep the names of countries in the top 5 (generated above). Countries on in the top 5 will be changed to "Others"
  - Turn this column into a factor type using the forcats package (already in tidyverse)
  - Group the data again, this time, group by date and country
  - Calculate the total number of cases per date per country
  - Create a new column called pct_cases, which is the percentage of total cases per date per country
  - (Optional) remove rows with NA from the tibble
  - There are many ways to do this part of the task
- Save all of this into a new object, e.g. plot_data

Tip: data grouping and summarisation

Group data into groups with dplyr::group_by() and ungroup data with dplyr::ungroup()
Summarise each group into a single row with dplyr::summarise()
You can have multiple groups at the same time. For example:

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.

You can choose not to summarise, but add a new column to each group with dplyr::mutate(). For example:

This will create a new column total_cases that contains the sum of cases for each country without reducing each group into a single row like summarise(). Values will be duplicated for each group.

Tip: get top n values in a tibble

You can select rows with the top n values in a tibble using dplyr::slice_max()

Tip: get values of a column in a tibble

You can extract values from a columns in a tibble using dplyr::pull()

Tip: turn a tibble column into a factor

You can create and work with factors inside a tibble using the forcats package (already in tidyverse)
You can “lump”, or gather, factor levels that are not frequent with forcats::fct_lump_n()

Task 4: Join data

Use this data country_region.rds to get the regions and sub-regions for all countries.
Make these 2 tables:

Write a short paragraph summarising key insights from the table, using inline R code to dynamically display case numbers. For example:

Europe (192,068) and Asia (126,420) reported the highest number of cases, while Oceania (1,879) had the lowest.

Noted that the number of cases is not hard-coded but you use inline R code to generate it within the text.

Tip: join the tables with left_join()

You can join covid_cases and country_region, using the country column of covid_cases and the alpha_3 column of country_region.

Tip: make the total cases table

You can use group_by() and summarise() to compute the total cases by sum(cases).

Tip: make inline R code

In the paragraph, use inline R code like {r} scales::comma(data$total_cases[tmp1$region == "Europe"]) to reference computed values. Function scales::comma() will make the numbers appear with commas for thousands (e.g., 100,000 instead of 100000).

Task 5: Data visualization

Using ggplot2, try your best to generate the following figure:

Tip: how do I draw that??

ggplot2 reference page is very useful
Here are all the functions used to generate the plot above, in sequence:
- ggplot()
- geom_area()
- scale_y_continuous()
- scale_x_date()
- scale_fill_discrete()
- ggtitle()
- Every function is connected with the + operator