k-means Clustering in R

In our paper on discovering frequently recurring sequences of movement within team-sport athlete data, k-means clustering was used on velocity data. Rather than setting pre-determined velocity thresholds to identify when an athlete is walking or sprinting, k-means clustering binned each athlete’s data into one of four groups. The code below uses dummy netball data to visualise and analyse velocity data via k-means clustering. This technique works by iterating over a set of observations (velocity data) and a set number of groups (n = 4). TheĀ k-means algorithm finds the center of each group, allocating each data point based on the closest center and iteratively (re)assigning the center until each data point within the set is allocated. More on k-means clustering can be found here.

To start, we consider dummy X, Y data from an example netball Wing Attack:

head(DummyData, 10)
# A tibble: 10 x 6
 Position X Y Sample Time Velocity
 <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
 1 WA 23.7438 1.8930 1 0.01 1.238103
 2 WA 23.7555 1.9015 2 0.02 1.446167
 3 WA 23.7689 1.9111 3 0.03 1.648393
 4 WA 23.7841 1.9219 4 0.04 1.864618
 5 WA 23.8012 1.9338 5 0.05 2.083315
 6 WA 23.8202 1.9469 6 0.06 2.307834
 7 WA 23.8409 1.9609 7 0.07 2.498980
 8 WA 23.8632 1.9759 8 0.08 2.687545
 9 WA 23.8870 1.9915 9 0.09 2.845699
10 WA 23.9118 2.0076 10 0.10 2.956772

It took me a very long time to draw a netball court in R with ggplot2! If you wish to plot your own dummy data on a netball court, save time by using my netball court line markings below. You may need to alter the 0,0 based on your own x-y coordinates.

NetballCourt

Load the required packages and data into your R environment.

 # Load required packages
require(ggplot2)
require(readxl)
# Load dummy data into environment
DummyData <- read_excel("C:/Users/Downloads/DummyData.xlsx")
NetballCourt <- read_excel("C:/Users/Downloads/NetballCourt.xlsx")

First, visualise dummy trajectory on the netball court:

ggplot(DummyData) +
geom_point(aes(x = X, y = Y, color = Velocity)) +
scale_colour_gradientn(colours = rainbow(7)) +
coord_equal() +
theme_bw() +
theme(plot.background = element_blank(),
legend.position="bottom",
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.line = element_blank(),
axis.text.x = element_blank(),
axis.text.y = element_blank(),
axis.ticks = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
geom_path(data = NetballCourt, aes(X,Y), colour = "black", size = 1)

The above image should will appear, dependent upon your own XY data, in your “Plots” window:

NetballCourt

Then, if we wish to inspect velocity over the time (60 seconds) hypothetically recorded, we can run:

# Visualise velocity over time (seconds)
ggplot(data = DummyData,
 aes(x = Time, y = Velocity, color = Velocity)) +
 geom_point() +
 scale_colour_gradientn(colours = rainbow(7)) +
 xlab("\n Time (s)") +
 ylab(expression(Velocity ~ (m.s^-1))) +
 scale_x_continuous(limits = c(0, 60), expand = c(0, 0), breaks = seq(0, 60, by = 20)) +
 scale_y_continuous(limits = c(0, 6), expand = c(0, 0), breaks = seq(0, 6, by = 2)) +
 theme_classic() +
 theme(legend.position = "none")

This should again appear in your “Plots” window as:

Rplot

Next, we commence the k-means clustering. First, we need to set the seed and require the algorithm to split the data into four groups. Using the code below, the cluster centroid and variation can be assessed.

# Perform k-means clustering on the velocity column
# First place the column into a matrix
Velocity <- as.matrix(DummyData$Velocity, ncol=1)
# Declare the number of clusters, for example, four groups of velocity
nClusters <- 4
# Ensure the initial cluster points are held constant
set.seed(1)
# Obtain the kMeans cluster
VelocityClusters <- kmeans(Velocity[,1], nClusters)
# To obtain the centers, size and within cluster sum of squares
VelocityClusters$centers
VelocityClusters$size
VelocityClusters$withinss

By running the code above, we can see that the cluster centroids are:

> VelocityClusters$centers 
 [,1]
1 0.5
2 3.8
3 1.3
4 2.3

We now need to add which cluster each data point falls into. To do this, we add the clustered data frame back into the dummy data frame.

# Add the cluster data info back into the dummy data
DummyData$Cluster <- factor(VelocityClusters$cluster)
Centers <- as.data.frame(VelocityClusters$centers)

Now, we create a new data.frame based on the “Centres” data frame. Notationally, we may refer to each cluster as walking, sprinting, jogging and running, based on their centroid – as ordered above.

# Create a new column, based on "Centers" data.frame
DummyData$NotationalDescriptor <- factor(DummyData$Cluster,
levels = c(1,2,3,4),
labels = c("Walk", "Sprint", "Jog", "Run"))

To assess the number of all data points within each cluster, we can run the following code to generate a distribution plot.

 # Create a new column, based on "Centers" data.frame
ggplot(data = DummyData, aes(x = NotationalDescriptor, fill = NotationalDescriptor)) +
geom_bar(colour = "black", size = 1) +
xlab("\n Notational Descriptor") +
ylab("Count \n") +
scale_y_continuous(expand = c(0,0), limits = c(0, 2500)) +
theme_classic() +
theme(legend.position = "none")

The figure below should then appear in your “Plots” window:

Rplot03

We can see that walking and jogging have larger data points assigned compared to sprinting and running.

To plot our original velocity over time figure, with each point now assigned to the appropriate cluster, we run the following code:

# Plot based on the notational descriptor
ggplot(data = DummyData, aes(x = Time, y = Velocity, color = NotationalDescriptor)) +
geom_point() +
xlab("\n Time (s)") +
ylab(expression(Velocity ~ (m.s^-1))) +
scale_x_continuous(limits = c(0, 60), expand = c(0, 0), breaks = seq(0, 60, by = 20)) +
scale_y_continuous(limits = c(0, 6), expand = c(0, 0), breaks = seq(0, 6, by = 2)) +
# The line of code below changes the legend heading, when we wish to add a space between words
labs(colour = 'Notational Descriptor') +
theme_classic() +
theme(legend.position = "bottom")

The following plot can then be created:

Rplot02

If you have any questions surrounding the above or are interested in more R code from our recent paper on how to obtain movement sequences, please contact me.

Happy clustering!

 

Advertisements