How to explore data?

In this topic, we are focusing on getting our data ready for analysis and reporting. We will cover how to:

Interrogate and check the data for inaccuracies (explore)
Change the format and transform the data (tidy)
Visualise and clean the data (prepare).

You might also find this type of task referred to as data wrangling. You can expect to spend as much as 80% of your time on this part of the process, sometimes more!

artwork by Allison Horst

We are continuing to use one of the datasets available from Public Health Scotland:

Cancelled Planned Operations by Health Board

We’re also going to be using the Global Burden of Disease (GBD, or gbd in lowercase), which includes information on the causes of death globally. What we will be using is the number of deaths for the three broad cause categories - Injuries, Communicable diseases, and Non-communicable diseases. But the project and its freely available datasets also include much more detailed breakdowns, both in the types of diseases, countries, as well as different metrics. It is freely available from here:

Global Burden of Disease data

Here are the functions we’ll be covering in the topic videos:

base::library() we’re already familiar with for loading packages
base::names() allows us to view all the column names or variables in our dataet
dplyr::distinct() for exploring the unique values in our columns or variables
dplyr::filter() allows us to subset data by filtering out rows based on variable criteria
dplyr::select() allows us to subset data by selecting specific columns or variables
dplyr::glimpse() provides information on dimensions, data types and a small preview
dplyr::arrange() & dplyr::desc() lets you sort your data (default is ascending order)
tidyr::pivot_longer() lengthens the data, increasing rows and decreasing columns
tidyr::pivot_wider() widens the data, increasing columns and decreasing rows
dplyr::mutate() for changing or adding columns
stringr::str_replace() for finding and replacing strings (text) in your data
ggplot2::ggplot() & ggplot2::aes() lets R know you are about to make a plot and what you’re going to plot
ggplot2::geom_point() lets R know you want to make a scatter plot
base::replace() for finding and replacing values in your data

artwork by Allison Horst

Explore Data

The main package we will be using to explore the data is dplyr, which is part of the tidyverse. Watch the following video to learn about some of the functions for carrying out your initial exploration of your data.

If you want to recreate the following session on your own computer, you can download the cancelled_operations project folder here: cancelled_operations.zip

Note: To open the project, make sure to “unzip” or “extract all” from the folder, before clicking on the cancelled_operations.Rproj file to open the project in RStudio.

Alternatively, here is the individual cancelled_messy file: phs_cancelled_messy.Rda. Note: To load the data, find the file on your computer, double click to open (select RStudio if prompted, and select “yes” to loading into your Global Environment).

Tidy Data

Watch the following video to learn about some of the functions for reshaping your data so that you can get it into a format which makes it easy to plot. If you want to recreate the following session on your own computer, use the cancelled_operations folder that you downloaded before.

Download the video transcript link here

As described in the video, and visualized below, in a tidy dataset (a) each variable forms a column, (b) each observation forms a row, and (c) and each cell is a single measurement.

Tidy data cartoon

artwork by Allison Horst

Prepare Data

In this video we’re going to take a first look at visualising our data as a way of exploring it and looking for outliers, or mistakes. If you want to recreate the following session on your own computer, use the cancelled_operation folder that you downloaded before.

Download the video transcript link here

Other Useful Functions and Operations

Here are some other useful functions and operators you might find useful when wrangling with data.

The c() combine function and the %in% operator

The combine function, as you may well have guessed, is useful for combining several values: c(). We do this by listing the different elements between the brackets, separated by a comma. Text has to be within inverted commas.

The c() function crops up all over the place, for example, in functions where we want to list multiple values as the input to one argument. But it is particularly useful for filtering data.

When combined with the special type of operator, the %in% operator, it can save us extra typing and make our code more readable. We use the %in% operator to identify if a value or set of values appears within another set of values.

Let’s quickly recap how we might filter data if we didn’t know about %in% operator.

If we were interested in filtering our GBD dataset to only include “Communicable” and “Non-communicable” diseases, we might use the following code:

gbd_full %>%
  # also filtering for a single year to keep the result concise 
  filter(year == 1990) %>%
  filter(cause == "Communicable diseases" | cause == "Non-communicable diseases")

Which produces the following output:

GBD output

This works perfectly well, but what if we had a column where we wanted to look at data on a whole list of values. Perhaps we have a column with Health Boards and we want to look at data on all those on the East of Scotland. We would have to repeat | HBName == for each value.

This is where the %in% operator and c() function can save us some typing, see the following code which gives exactly the same output as our previous code:

gbd_full %>%
  # also filtering for a single year to keep the result concise 
  filter(year == 1990) %>%
  filter(cause %in% c("Communicable diseases", "Non-communicable diseases"))

Which produces the following output:

GBD output 2

The paste() function for joining values together

We’ve just explored the c() function which allows values to be “combined” but they are not actually joined to make a new variable, they remain separate.

If we want to make a new character variable by joining other variables together, we need to use the paste()function.

The paste() function allows us to paste several variables (e.g. words, numbers, or dates) into one character variable or sentence.

For example, we might want to create a new column with text which could be used as a label:

# create a new column
gbd_full %>%
  mutate(year_label_paste = paste("Year is", year))

Which produces the following output:

GBD output

In the paste() function we need to specify each element we would like to combine, separated by a comma. By default, the separation between the elements is a space, but we can change this using the sep = argument within the paste function.

Note: For more informtion on this, try looking the function up in the help tab in RStudio, either by pressing F1 when the cursor is within the function name, or by search in the Help tab search box.

The pull() function or the extract operator

We’ve seen how to use the select() function to select a specific column or columns from a dataset. The type of object we get out is the same as the type we put in, just smaller, so if we pipe a tibble (the name for a tidyverse data frame) into select(), we get a smaller tibble out.

Sometimes however we might need our output to be in a slightly different format, we might need it to be a vector, which we can think of as a simple list of values.

We can get a vector, either by using the base R extract operator $, or we can use the pull() function from the dplyr package.

Let’s see how both of these work:

# $ operator 
gbd_full$deaths_millions 
 
# pull() function 
gbd_full %>%
  pull(deaths_millions)

R will give us all the data for that variable - as seen in the following output:

GBD output

The round() function

Not surprisingly, we can use the round() function from base R to round variables to create integers.

For example, in the GBD dataset we might want to round the total deaths column (deaths_millions) to no decimal places.

# round to whole number 
round(gbd_full$deaths_millions)

Which produces the following output:

GBD output

Note: Here we use the $ sign to specify which column from the gbd_full dataset we wanted. We did not pipe (%>%) gbd_full into round() because the first argument in this base R function is looking for a numeric vector (a list of numbers) rather than the name of a tibble or data frame, which is what we are more used to seeing in the tidyverse functions.

Other Useful Packages

lubridate

Dates can often be a bit of a nightmare. They can come in many different formats and are prone to error and so can be awkward to work with. In the “Import Excel” video in the last topic we saw that R wasn’t able to recognise the date column when we used the Import Wizard.

Thankfully, like many problems in R, there is a package for just this issue and it’s called the lubridate package. It’s a tidyverse package which makes it easy to work with dates.

Here is a quick taster, you can:

parse (read in) a very wide range of formats with the ymd() function
extract the relevant parts from your dates with year() and month() functions
easily calculate time intervals with the interval() function

Lubridate cartoon

artwork by Allison Horst

janitor

Cleaning and tidying our data takes up a considerable amount of valuable time, but there’s one function which can help to make our lives a little bit easier when we’re going round the bend trying to fix badly named variables. It’s the clean_names() function from the janitor package.

The clean_names() function converts your variables into snake case style. This just means that it replaces all capital letters, punctuation and spaces with underscores to make the variables easier to work with because R doesn’t like your column headers to have spaces in them or weird characters and symbols.

Janitor clean names cartoon

artwork by Allison Horst