Exercises Day 5

Author

Tuyen Huynh and the team

Published

Last version: April 04, 2025

Notes and instructions

Piping

It is expected that the exercise answers will be mainly using the pipe operators |>, instead of putting functions inside each other

Dataset

We will use covid_cases.rds for this exercise.

Data dictionary
Column Description
date date of the case report
211 columns with the format cases_{country_code} number of reported cases of each country and archipelago

Country codes of interest (i.e. Codes for countries needed for the exercise)

Code Country name
chn China
usa United States of America
vnm Vietnam
sgp Singapore

Exercises

Task 1: Data import

  • Read data

  • Load the tidyverse meta-package into R

Note

A “meta-package” is a package that contains other packages. Remember, tidyverse is just a collection of packages.

Task 2: Data cleaning and filtering

  • Does the data follow the tidy data standard?

    • If not, pivot the data into a tibble that follows the tidy data standard
  • Do some quick skim on the data. Do the variable values make sense given its type?

    • If not, filter out incorrect/impossible data
  • Filter the data so that we only have week 3-12 of 2020

  • Save the results back into covid_cases

  • If your data is not tidy data:

    • “Lengthens” the data using tidyr::pivot_longer(), i.e. less columns, more rows

    • “Widens” the data using tidyr::pivot_wider(), i.e. less rows, more columns

  • The main variable here is case counts from each country. They should have a numeric type and should not be lower than 0. You can’t have negative case counts

  • You can remove or filter undesirable values using dplyr::filter()

Task 3: Data transformation

  • Using the covid_cases object created in Task 2:

    • Group the data and calculate the total number of cases per country

    • Select the top 5 countries with highest total cases

    • Extract the country codes and save them into a new object, e.g. top_countries

  • Use the covid_cases object created in Task 2 again:

    • Group the data again, calculate the total number of cases per country again

    • Modify the country column:

      • So that we only keep the names of countries in the top 5 (generated above). Countries on in the top 5 will be changed to "Others"

      • Turn this column into a factor type using the forcats package (already in tidyverse)

      • Group the data again, this time, group by date and country

      • Calculate the total number of cases per date per country

      • Create a new column called pct_cases, which is the percentage of total cases per date per country

      • (Optional) remove rows with NA from the tibble

      • There are many ways to do this part of the task

    • Save all of this into a new object, e.g. plot_data

`summarise()` has grouped output by 'date'. You can override using the
`.groups` argument.
  • You can choose not to summarise, but add a new column to each group with dplyr::mutate(). For example:

This will create a new column total_cases that contains the sum of cases for each country without reducing each group into a single row like summarise(). Values will be duplicated for each group.

  • You can extract values from a columns in a tibble using dplyr::pull()
  • You can create and work with factors inside a tibble using the forcats package (already in tidyverse)
  • You can “lump”, or gather, factor levels that are not frequent with forcats::fct_lump_n()

Task 4: Join data

  • Use this data country_region.rds to get the regions and sub-regions for all countries.

  • Make these 2 tables:

  • Write a short paragraph summarising key insights from the table, using inline R code to dynamically display case numbers. For example:

Europe (192,068) and Asia (126,420) reported the highest number of cases, while Oceania (1,879) had the lowest.

Noted that the number of cases is not hard-coded but you use inline R code to generate it within the text.

You can join covid_cases and country_region, using the country column of covid_cases and the alpha_3 column of country_region.

You can use group_by() and summarise() to compute the total cases by sum(cases).

In the paragraph, use inline R code like {r} scales::comma(data$total_cases[tmp1$region == "Europe"]) to reference computed values. Function scales::comma() will make the numbers appear with commas for thousands (e.g., 100,000 instead of 100000).

Task 5: Data visualization

  • Using ggplot2, try your best to generate the following figure:

  • ggplot2 reference page is very useful
  • Here are all the functions used to generate the plot above, in sequence:
    • ggplot()
    • geom_area()
    • scale_y_continuous()
    • scale_x_date()
    • scale_fill_discrete()
    • ggtitle()
    • Every function is connected with the + operator