The Office — Data Analysis and Visualization with R

Vaishali Verma
6 min readOct 29, 2020

The visual that helps us identify the person that is the glue, connecting most of the recurring cast of characters.

The Design

This visual is a copy of the one originally made by Jake Kaupp. In the process of learning how to wrangle and visualize data with R, my plan is to recreate graphics that appeal to me and I could not resist when I saw Jake’s visualization about this. I had a hard time understanding his code and had to write to him demanding explanation. Very kind of him to answer back.

The Code

The graphic is an arc diagram which is a type of network visualization between characters across the entire run of The Office. A network is made up of nodes(or vertices) and edges(or links).

The data is available within package ‘schrute’ . We install it using
install.packages() and attach its library().

install.packages('schrute')
library(schrute)

A tibble of 55,130*12 is loaded.

Since my interest is particularly in the ‘season’, ‘episode’, ‘character’ columns, I remove the remaining noise.

the_office <- subset(theoffice, select = -c(1,4,5,6,8,9,10,11,12))

Next I understand the distinct characters who have recorded on screen for every episode in every season. I decide to count the number of their appearances, take note of only those who have appeared more than 50 times and filter them out. These characters are the nodes of the network.

nodes <- the_office %>%
distinct(season, episode, character) %>%
count(character, sort = TRUE) %>%
filter(n>50)

A 17*2 tibble of character and n appears.

Once the people are identified, I build the episode list ep_list . This is the list of people who appear in an episode. Without having this list, I cannot build the edges of the network. I need to see the number of people and their records in every episode.

semi-join allows me to have all rows of episode list, for which there are matching values in nodes. This automatically moves people with less than 50 number of appearances.

If I group_by season and episode, I get the number of characters that have appeared in every single episode. I then summarize it in a column with header ‘to’ . I’m using a toString(character) to summarize the characters in each episode as a string separated by commas. Wrapping that in a list is a trick to make a nested data structure with a single element in each row.

ep_list <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
group_by(season, episode) %>%
summarise(to = list(toString(character)))

A 186*3 tibble appears with headers — Season, Episode, To (list format)

Now that I have my nodes and ep_list ready, I will use them to build my edges. Edges are the arcs that show connection and travel in between the nodes.

To make explanation easier, I will explore code in ‘edges’ line-by-line

I try to identify distinct appearances for every character in every season and episode.

edges <- the_office %>%
distinct(season, episode, character)

A tibble of 4006*3 with headers — season, episode, character is returned.

This gives me the list of people who appear throughout the episode. I again semi-join with nodes to consider people with more than 50 appearances.

To see the number of appearances people have together in every episode, ep_list (list of people in every episode) has to be joined. I use inner_join which returns all rows of edges, as they are above, along with matching values in ep_list.

edges <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
inner_join(ep_list)

A tibble of 2528*4 with headers — season, episode, character, to(list) is returned.

List is a nested object and we need to unnest it to perform further operations. Since it is under the header ‘to’, we unnest that.

These unnested elements have to sit on separate_rows in the tibble.

Since the list is split on each row, there is blank / white space (ws) before every name that needs to be trimmed using trimws We mutate ‘to’ again.

To keep it simple, rename of character to from.

edges <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
inner_join(ep_list) %>%
unnest(to) %>%
separate_rows(to) %>%
mutate(to = trimws(to)) %>%
rename(from = character)

A tibble of 35,062*4 with headers — season, episode, character, to is returned.

Since we cannot count the appearances of people with themselves, we filter them out.

Irrespective of season and episodes, we have to count the number of appearances people have had together. The total count will be in a column called ‘size’.

edges <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
inner_join(ep_list) %>%
unnest(to) %>%
separate_rows(to) %>%
mutate(to = trimws(to)) %>%
rename(from = character) %>%
filter(from != to) %>%
count(from, to, name = "size")

A tibble of 272*3 with headers — from, to, size is returned.

left join nodes (which contains the number of records) to the ‘to’ and ‘from’ . I use repeated left join to get the interaction numbers for both ‘to’ and ‘from’.

edges <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
inner_join(ep_list) %>%
unnest(to) %>%
separate_rows(to) %>%
mutate(to = trimws(to)) %>%
rename(from = character) %>%
filter(from != to) %>%
count(from, to, name = "size") %>%
left_join(nodes, by = c('from' = 'character')) %>%
left_join(nodes, by = c('to' = 'character'))

A tibble of 272*5 with headers — from, to, size, n.x, n.y is returned.

A temporary variable, temp, is introduced to help count the correct interactions. Example, Jim → Andy and Andy → Jim are the same. That is done by comparing n.x and n.y to organize the labels to a consistent pairing.

I keep the distinct temp and .keep_all = TRUE helps with keeping the associated columns too.

edges <- the_office %>%
distinct(season, episode, character) %>%
semi_join(nodes) %>%
inner_join(ep_list) %>%
unnest(to) %>%
separate_rows(to) %>%
mutate(to = trimws(to)) %>%
rename(from = character) %>%
filter(from != to) %>%
count(from, to, name = "size") %>%
left_join(nodes, by = c('from' = 'character')) %>%
left_join(nodes, by = c('to' = 'character')) %>%
mutate(temp = if_else(n.x>n.y, paste0(from,to), paste0(to,from))) %>%
distinct(temp, .keep_all = TRUE) %>%
subset(select = -c(n.x,n.y,temp))

A tibble of 138*3 with headers — ‘from’, ‘to’, ‘size’ is returned.

A tbl_graph of nodes and edges is made for plotting.

graph <-  tbl_graph(nodes, edges)

The plan is to find the most important person using centrality_degree() and to highlight it in a different colour. Thus the colour of important node and edges going-to and coming-from that node are highlighted.

important <- graph %>%
mutate(degree = centrality_degree()) %>%
as_tibble() %>%
arrange(desc(degree)) %>%
top_n(1) %>%
pull(character)
edge_position <- which (nodes$character == important)

Ready to plot.
This last part is a mix of some assigning visual elements to the nodes and edges by a combinations of activate, mutate and if_else comparing the node or edge to the important variable or comparing edge position by edge_pos.

Once you hit the ggraph it enters into using a ggplot2 extension package called ggraph. It’s a grammar of graph graphics and plots elements in the data by each specific layer/geom. First makes the object with the layout, then adds edges in arc shapes with the indicated aesthetics (geom_node_arc), the nodes at points (goem_node_point), then text at nodes (geom_node_text). Then it adds labels with labs. I then set aesthetic mappings with all the scale_???_identity functions as they are all directly mapped in the data.

In order to pay homage to Jake, I use his custom theme_jk

dev.copy() is used to give the file name

dev.print() is to print what is on screen within the given dimensions

dev.off() is to be able to see plots made after this plot else they will not appear in the plot window.

dev.copy(png, "TheOffice.png")
dev.print(width = 10, height = 7)
dev.off()

This works!!

--

--