Introduction to R and A Basic Analysis of Athlete Load

R is a free, open-source programming language and software environment for statistical computing and graphics. Learning how to use R for data analysis and visualisation purposes can be a daunting task. However, there are a number of free online resources to guide basic analysis and troubleshoot where possible. These include:

  • Cookbook for R: An online guide to provide solutions for common tasks and problems when analysing data in R
  •  R Tutorial: An introduction to statistics that explains basic concepts in R
  • Quick-R: A website that assists with data input, management and statistics in R
  • R-bloggers: A news and information site that pulls together blog posts on R
  • Stack Overflow: A question and answer forum on all things code, statistics and plotting

The above websites are a fantastic resource on how to get started in R with basic analysis. To complement, I have constructed a basic guide for sport science and physiology users using athlete load as an example. Personally, I prefer to work in RStudio, which provides a free, friendly interface to run code and view plots.

To start, run the following code to load a .csv or .txt file into R and name the file as “RawLoadData”. You will need to substitute the file location below with your own.

# Read a .csv file into R
RawLoadData <- read.csv("/Volumes/Research/Thesis/Manuscripts/AthleteLoadData.csv")
# Read a .txt file into R
RawLoadData <- read.table("/Volumes/Research/Thesis/Manuscripts/AthleteLoadData.txt")

R uses data.frames and matrices to store data. The difference between the two, is a matrix requires all rows and columns to be of the same class, numeric or factor, for example. A data.frame allows you to have a mixture of the two. You can switch between the two using as.data.frame and as.matrix, although be aware that if you convert a data.frame with different classes, they will all be characters in a matrix. To create a data.frame or matrix, use the following code:

# Create a data.frame
RawLoadData <- as.data.frame(RawLoadData)
# Create a matrix
RawLoadData <- as.matrix(RawLoadData)

To create a data.frame of dummy athlete load data collected over seven days, use the code below.

# Create a list of athlete names
Athletes <- c("Charles", "Mia", "Alfie", "Sophie")
# Call out the constants
NumberOfAthletes <- 4
DaysOfLoad <- 7
set.seed(28)
# Create the data.frame
RawLoadData <- data.frame(Athletes = rep(Athletes, DaysOfLoad),
Day = rep(1:DaysOfLoad, each = NumberOfAthletes),
Load = runif(NumberOfAthletes * DaysOfLoad, min = 0, max = 100))

The structure of a data.frame can be accessed by typing and running the code below. This allows us to see that the dataset consists of athlete names, or factors, days or integers and load that consists of a numeric variable.

str(RawLoadData)

Columns of a data.frame can be viewed by typing and running RawLoadData$Athletes however, this command will not work for a matrix. To create a new column and add to our existing `data.frame`, such as the playing position of each athlete, type and run the following code:

# To create a new column
RawLoadData$PlayingPosition <- c("Midcourt")

Summary statistics of grouped data can easily be calculated with assistance from the plyr package, that will need to be installed into R prior to first use. To calculate the mean and SD of load, purely as an example, for each day use the following:

# Load the required package
require(plyr)
# To calculate the mean and SD for load, across each day
SummaryLoadData = ddply(RawLoadData, c("Day"), summarise,
Mean = mean(Load),
SD = sd(Load))

The above data can then be plotted the ggplot2 package which needs to first be installed into R. Use the code below to visualise the mean data:

# Load the required package
require(ggplot2)
# Basic plot of the mean and SD of load over a seven day period
ggplot(SummaryLoadData, aes(x = Day, y = Mean)) +
geom_bar(stat = "identity")

Load_BasicPlot

A few tweaks will deliver us a much more visually pleasing plot. To have a white background, coloured bars, clearer axis labels and bold ticks plus marks, use the following code:

# A few tweaks to create a neater plot
ggplot(SummaryLoadData, aes(x = Day, y = Mean, fill = factor(Day))) +
geom_bar(stat = "identity") +
ylab("Average Load (AU)\n") +
xlab("\nDay") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 80)) +
scale_x_continuous(breaks = c(1:7)) +
theme_classic() +
theme(legend.position = "none",
axis.line = element_line(colour = "black", size = .75, linetype = "solid"),
axis.title.x = element_text(face = "bold", size = 15),
axis.title.y = element_text(face = "bold", size = 15),
axis.text.y = element_text(face = "bold"),
axis.text.x = element_text(face = "bold"),
axis.ticks = element_line(size = .5))

Load_Neater

To plot individual load data, use the following code:

# Individual responses
ggplot(RawLoadData, aes(x = Day, y = Load, fill = factor(Athletes))) +
geom_bar(stat = "identity") +
ylab("Load (AU)\n") +
xlab("\nDay") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_continuous(breaks = c(1:7)) +
theme_classic() +
theme(strip.text.x = element_text(size = 12, face = "bold"),
strip.background = element_rect(colour = "black", size = 1.5),
legend.position = "none",
axis.line = element_line(colour = "black", size = .75, linetype = "solid"),
axis.title.x = element_text(face = "bold", size = 12),
axis.title.y = element_text(face = "bold", size = 12),
axis.text.y = element_text(face = "bold"),
axis.text.x = element_text(face = "bold"),
axis.ticks = element_line(size = .5)) +
facet_wrap(~ Athletes)

Load_IndividResps

To overlay the average load over each individual’s data, use the following code:

# Extend the Summary Load data.frame
SummaryLoadData <- SummaryLoadData[rep(seq_len(nrow(SummaryLoadData)), each=4),]
# Add the mean column to the RawLoadData frame
RawLoadData$Mean <- SummaryLoadData$Mean
# To plot individual athlete load data
ggplot(RawLoadData, aes(x = Day, y = Load, fill = factor(Athletes))) +
geom_bar(stat = "identity") +
geom_line(aes(y = Mean, x = Day), color = "Black", size = 1) +
geom_point(aes(y = Mean, x = Day), color = "Black", size = 1.5) +
ylab("Load (AU)\n") +
xlab("\nDay") +
scale_y_continuous(expand = c(0, 0), limits = c(0, 100)) +
scale_x_continuous(breaks = c(1:7)) +
theme_classic() +
theme(strip.text.x = element_text(size = 12, face = "bold"),
strip.background = element_rect(colour = "black", size = 1.5),
legend.position = "none",
axis.line = element_line(colour = "black", size = .75, linetype = "solid"),
axis.title.x = element_text(face = "bold", size = 12),
axis.title.y = element_text(face = "bold", size = 12),
axis.text.y = element_text(face = "bold"),
axis.text.x = element_text(face = "bold"),
axis.ticks = element_line(size = .5)) +
facet_wrap(~ Athletes)

Load_IndividMeans

The above is only a small introduction to R’s analysis and visualising capabilities. How do you analyse and present athlete load data?

Analysing and Visualising Repeated Measures Data

Scientists working in exercise physiology often design experiments containing repeated measurements on different athletes or participants over the course of time.

One example, in a sport science setting, is the monitoring of a team-sport athlete’s response to training and competition loads. The countermovement jump (CMJ) is used to monitor an athlete’s neuromuscular status. A CMJ will often be performed by an athlete prior to training or match and monitored over the course of a week, tournament or season.

To show how repeated measures data can be analysed and visualised in R, I have created a (hypothetical) example of different athletes performing two trials of a CMJ at two different times of the day and monitored over a three day period. I have chosen Peak Velocity as my dependent variable purely for display purposes only. A useful variable to monitor the neuromuscular status of Australian Rules athletes appears to be Flight Time:Contraction Time.

Athletes = c("Gus", "Hudson", "Bobby", "Tom", "Jessie")
# CMJ performed at two different time points
TimeOfDay = c("AM", "PM")
# Set the start date of data collection
StartDate = as.Date("2016-02-01")
# Set the seed, to reuse the same set of random variables
set.seed(60)
# Create a data.frame containing dummy raw CMJ data
CMJRawData = data.frame(Name = rep((Athletes), each = 4),
Day = rep((weekdays(StartDate + 0:2)), each = 20),
Trial = as.numeric(rep(1:2, each = 1)),
TimeOfDay = rep((TimeOfDay), each = 2),
PeakVelocity = runif(60, 1.5, 2.8))

Plots can be created in R using the base packages however, I prefer to use ggplot2 due to it’s easy to follow syntax and ability to create complex figures in a visually pleasing manner. This package will need to be installed into R prior to loading.

# Load the required ggplot2 package
require(ggplot2)
# Create a basic box and whisker plot to visualise Peak Velocity
ggplot(data = CMJRawData, aes(x = Day, y = PeakVelocity)) +
 geom_boxplot(aes(fill = TimeOfDay))

BW_Basic
The above plot is OK however, I personally prefer a cleaner plotting background plus emphasised ticks and correct axis labels. The code below creates a much more visually pleasing plot that can be used for presentation or scientific purposes.

# Create a neater looking plot with correct scientific notation
ggplot(data = CMJRawData, aes(x = Day, y = PeakVelocity)) +
geom_boxplot(aes(fill = TimeOfDay)) +
ylab(expression(Peak ~ Velocity ~ (m.s^-1))) +
scale_y_continuous(expand = c(0, 0), limits = c(0, 4)) +
theme_classic() +
theme(legend.title = element_blank(),
legend.text = element_text(size = 13),
strip.text.x = element_text(size = 15, face = "bold"),
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(color="black", size = 15, vjust=1.5),
axis.line.y = element_line(colour = "black"),
legend.position = "bottom") +
facet_wrap(~ Day, scales="free_x")

BW_Clearer

I found the above plot much easier to read. The labels or ticks can be highlighted by including face = “bold” where appropriate. The y-axis scale can also be adjusted to zoom in on the figure and lose the white space by simply configuring the line:

scale_y_continuous(expand = c(0, 0), limits = c(0, 4))

To obtain summary statistics for a repeated measures dataset, install the package “psych” into R and run the following code. Other statistics, including the SE, can also be obtained by substituting “se” for “sd” in the line of code below.

# Load the required package
require(psych)
# Obtain the Mean and SD for Peak Velocity, over each day and time of day
PeakVelocity = ddply(CMJRawData, c("Day", "TimeOfDay"), summarise,
Mean = mean(PeakVelocity),
SD = sd(PeakVelocity))

Of interest to many scientists and practitioners is the individual response to training. This can be calculated and plotted using the code below, to track athletes over time. Note: I have calculated a mean Peak Velocity measure from the two trials at each time of day.

# Calculate the Mean and SD for Peak Velocity, for each athlete, over each day and time of day
PeakVelocityIR = ddply(CMJRawData, c("Day", "TimeOfDay", "Name"), summarise,
Mean = mean(PeakVelocity),
SD = sd(PeakVelocity))
# Plot individual responses across time for each day
ggplot(data = PeakVelocityIR, aes(x = TimeOfDay, y = Mean, colour = Name, group = Name)) +
geom_point() +
geom_line() +
ylab(expression(Mean ~ Peak ~ Velocity ~ (m.s^-1))) +
xlab("\nTime of Day") +
scale_y_continuous(expand = c(0, 0), limits = c(1.5, 2.75)) +
theme_classic() +
theme(legend.text = element_text(size = 13),
strip.text.x = element_text(size = 15, face = "bold"),
axis.title.y = element_text(color="black", size = 15, vjust=1.5),
axis.title.x = element_text(color="black", size = 15),
axis.line.y = element_line(colour = "black")) +
facet_wrap(~ Day, scales="free_x") 

IndividResponses

Individual responses can also be displayed with a mean or group average overlay. This is displayed below.

# Create a new column, called name, for plotting
PeakVelocity$Name = c("Average")
# Plot the individual responses plus means
ggplot(PeakVelocity, aes(x = TimeOfDay, y = Mean, group = Name)) +
geom_point(colour = "black", size = 4) +
geom_line(colour = "black") +
geom_point(data = PeakVelocityIR, aes(x = TimeOfDay, y = Mean,
group = Name, colour = Name)) +
geom_line(data = PeakVelocityIR, aes(x = TimeOfDay, y = Mean,
group = Name, colour = Name)) +
ylab(expression(Mean ~ Peak ~ Velocity ~ (m.s^-1))) +
xlab("\nTime of Day") +
scale_y_continuous(expand = c(0, 0), limits = c(1.5, 2.75)) +
theme_classic() +
theme(legend.text = element_text(size = 13),
strip.text.x = element_text(size = 15, face = "bold"),
axis.title.y = element_text(color="black", size = 15, vjust=1.5),
axis.title.x = element_text(color="black", size = 15),
axis.line.y = element_line(colour = "black")) +
facet_wrap(~ Day, scales="free_x")

IndividualResp_Mean

The above is only a sample of how repeated measures data can be visualised. What methods do you use to display repeated measures data? How do you clearly communicate individual responses?