# Samples of Thoughts

## about data, statistics and everything in between

I recently downloaded my own Facebook data and wanted to find out what kind of data gems I could find. There are some clear advantages when analyzing your own data, foremost, you’re the expert and know the “ground truth” behind the data. That said, there can still be big surprises!

In my case, the most interesting parts of the analysis could be boiled down in two graphics. Since there’s also a time factor in the data, I thought this is a good opportunity to learn about animated plots and indeed, it works quite beautifully with the two plots. In this post, I’ll outline how to make two animated plot, a racing bar chart, and a moving line plot. As a cherry on top, I also show how to combine the two animations!

## Some data prep

For this analysis, I only concentrated on the messenger data which can be found in the folder messages/inbox. Unfortunately, Facebook doesn’t use proper JSON encoding which makes it quite a hassle to load the data without any decoding errors. For this analysis, I’m only using the aggregated messages so I simply ignored any decoding mistakes in the text. I wrote a few functions that load all the data (using a lot of {{purr}}) and result in a data frame with the number of messages per chat and per (active) chat participant, per day. I’m not going into the details of the data loading and preprocessing code, if you’re interested you can find it on Github.

library(tidyverse)
# script with functions can be found here:
d <- read_all_messages(data_path)
chatname day sender_name num_messages
facebookuser_0_mgnzvrba 2015-07-09 Corrie Bartelheimer 19

Normally, the chatname consists of the name of the Facebook friend but if they at some point deleted their account, it is replaced by the anonymous “Facebook User”. You can still read the messages though, meaning, if you’ve deleted your Facebook data at some point, your conversations remain with Facebook.

## Mirror, Mirror on the Wall Who Writes the Most

The first plot I wanted to do is a ranking with which friend I exchange most messages. Since this certainly changes over time, this data perfectly lends itself for a racing bar chart.

For this, we’ll need to compute the total number of messages for each chat at any given time and check if at that time, the chat was among the top ten. However, if we naively compute the messages per month, there will be some months where I didn’t chat at all with a friend and thus there is no data point for this month. So I add a zero-message data point for missing months. For this, I generate a grid that for each chatpartner includes one row for each month which I then join to the message data set.

first_month <- as.Date(min(d$day)) %>% floor_date(unit="month") last_month <- as.Date(max(d$day)) %>% floor_date(unit="month")
all_months <- expand_grid(
month=seq.Date(from=first_month, to=last_month,
by="month"),
chatname = d$chatname %>% unique )  I use a monthly time step, so I first compute the number of messages per chat per month and then the cumulative sum for each chat and month: dcs <- d %>% mutate(month = floor_date(day, unit="month") %>% as_date() ) %>% select(month, chatname, num_messages) %>% # full_join introduces NA values for num_message at missing time points full_join(all_months, by=c("month", "chatname")) %>% group_by(chatname, month) %>% # the sum of NA values is 0 summarise(num_messages = sum(num_messages, na.rm = T), .groups = "drop_last") %>% arrange(month) %>% mutate(cs_messages = cumsum(num_messages)) %>% ungroup() Next, we compute the ranks for each month: dcs <- dcs %>% group_by(month) %>% arrange(-cs_messages) %>% mutate(rank = row_number()) %>% ungroup()  Now we restrict the data to include only chatpartners that, at any given time, were in the top 10: dcs_filtered <- dcs %>% filter( rank <= 10 ) All other tutorials I found on racing bar charts in R use either geom_bar() or geom_tile(). However, I wasn’t too happy with the hard corners and wanted to go for a rounder look, so I decided to use chicklets instead from the package {{ggchicklet}}. It works almost the same as geom_bar() but isn’t as flexible with switching the x and y aesthetics so it is necessary to use it with coord_flip(). I manually added some friend categories to color the bars roughly by when and where I met the chatpartner. library(ggchicklet) p <- dcs_filtered %>% ggplot(aes(y=cs_messages, x=-rank, group=chatname, fill=friend_cat)) + geom_chicklet( width=0.8, radius=unit(7, "pt"), position = "identity") + coord_flip() + geom_text(aes(label = anon_label), family="Lato", hjust = "left", colour = "grey30", nudge_y = +100, size = 5) + geom_text(y=15000, x=-7.5, aes(label=as.character(year(month))), family="Lobster Two", size=30, color="grey30") + plot_styling # full styling code on github If you prefer bars instead of chicklets, simply replace geom_chicklet() in the code above with geom_bar(stat = "identity"). To animate the plot, we add the following commands from {{gganimate}}: library(gganimate) p_anim <- p + transition_time(month) + ease_aes('linear') p_anim Voilà, a racing chicklet chart! 🎉 I added anonymous labels to some of my most important chat partners so you can follow the story of the chart. ## Getting a Smoother Look This is already looking not too bad but it isn’t looking very smooth and we also don’t have any control about the speed of the animation. It is possible to get a slightly smoother look by playing around with the parameters of how many frames per seconds fps and how many frames nframes should be computed in total. Note that the more frames you compute, the longer it takes. So during the creation of the animation, I recommend to use a small number for faster development iterations. duration <- 30 # in seconds fps <- 10 nframes <- duration*fps animate(p_anim, nframes = nframes, fps = fps, width = 1100, height = 700, end_pause = 15) anim_save("race.gif")  However, this only helps to a certain degree. The rank variable that determines the vertical position of a bar is an integer, meaning if a bar increases in rank, it simply swaps position with the bar above. To get a much smoother look, we can use the package {{tweenr}}. It interpolates values for both rank and the number of messages between two time points. The package can be used as follows: tween_dcs <- dcs %>% select(chatname, month, cs_messages, rank) %>% mutate(ease = "linear", # tween_elements() doesn't work with date objects month = as.integer(month)) %>% tween_elements(., time="month", group="chatname", ease="ease", timerange = range(dcs$month %>% as.integer),
nframes = nframes ) %>%
select(month, cs_messages, rank, .frame, chatname = .group) %>%
mutate(month = as_date(month)) %>%
filter( rank <= 10 )

The magic happens in the function tween_elements() where we specify which column holds the time variable (month), which one the group (chatname) and which column specifies the easing to be used. I’m using the same easing function for each group but technically one could specify different easings for the different groups.

Since we’re using the same plot as before and are just changing the data, we can use some ggplot magic by simply adding the new data to the plot:

p_anim %+% tween_dcs

This trick is also useful when you want to get a static version of the plot without overplotting all time points:

p %+% (tween_dcs %>% filter(month == "2020-01-01"))

I find it quite nice to see how different times of my life are reflected in the chart: In the beginning, I mostly chat with friends and family from my home town. Slowly, some new friends appear when I start university and then more friends I met during my exchange year. I started some new hobbies such as dancing and learning Portuguese in a way “diversifying” my friends. You can probably also guess from the plot when I met my partner. There were some surprises as well. For example, one of my best friends doesn’t even appear in the top ten. We never used Facebook that much and either chatted on WhatsApp or went straight for a call. Other people that appeared in the top ten, I didn’t even recognize at first: When I started studying Portuguese, I made some penpal friends to practice the language. We must have chatted quite a lot back then but we never managed to meet. Similarly with the second place, I was completely surprised. I met the person at a random event in Berlin, we hang out a few times but even though we were never really close, apparently we chatted a lot back then.

## Following the Time Line

Parts of this story comes out better if we concentrate on the time line of the different friend categories instead of focusing on single chatpartners. Making an animated time line plot is actually easier than a racing bar chart. Let me walk you through the steps.

As before, we aggregate the data again, only this time we aggregate by friend category and month (actually, I’m using every two month because the resulting plot looks smoother). Also, instead of the cumulative sum, we use a simple sum per month.

dns <- dcs %>%
mutate(month = floor_date(month, unit="2 months")  ) %>%
select(month, friend_cat, num_messages) %>%
group_by(friend_cat, month) %>%
summarise(num_messages = sum(num_messages, na.rm = T),
.groups = "drop") %>%
filter(month >= "2011-01-01" & num_messages != 0)

As before, we use the the {{tweenr}} package again to obtain a smoother look:

tween_dns <- dns %>%
mutate(ease = "linear", month = as.integer(month)) %>%
tween_elements(., time="month", group="friend_cat", ease="ease",
timerange = range(dns\$month %>% as.integer),
nframes = nframes ) %>%
select(month, num_messages, .frame, friend_cat = .group) %>%
mutate(month = as_date(month)) 

The basic code for the plot is only two lines:

p_line <- tween_dns %>%
ggplot(aes(x=month, y=num_messages,  color=friend_cat)) +
geom_line(size=1.3) +
line_styling # full styling code on github

The advantage with this plot is that if we want to obtain a static version of it, we don’t have to restrict the data we feed in:

p_line

That’s because the plot itself doesn’t change over time (like it did with the ranking) but rather more data is added, or revealed over time. So to animate it, we use transition_reveal(). The function view_follow() means that we use a dynamic x axis to follow the data.

p_line +
view_follow(fixed_y = TRUE) +
transition_reveal(month) +
ease_aes('linear')

## Composing the Big Picture

Now you might wonder, could we put both animations into one big animation? Well yes, but unfortunately not with {{gganimate}}. What we can do instead is to manually recreate what {{gganimate}} does under the hood: We iterate over each frame (as computed by {{tweenr}}) and save it as .png file. This gives you a flip book of your chart. I then used ImageMagick to create the .gif. It is possible to also generate the animation this way using R (e.g. with {{animation}}) but my RStudio crashed every time I tried (it’s quite memory and computing intensive!) so I instead did it the manual way.

I used {{patchwork}} to combine the two plots (if you don’t know the package yet, I recommend to go check it out, it’s great!). Now let’s generate the flip book:

library(patchwork)

# fix axes limits
p_line <- p_line + scale_y_continuous(limits = c(1, 4098))
p <- p + scale_x_continuous("", limits = c(-10.4, -0.6))

# generate flip book
for (i in 0:nframes) {
composition <- ( p %+% (tween_dcs %>% filter(.frame == i)) +
p_line %+% (tween_dns %>% filter(.frame <= i) ) +
plot_layout(nrow=2, guides = "collect")
) +
plot_annotation(caption = caption) &
comp_styling # full styling code on github

composition +
ggsave(
here::here(paste0(folder_path,
"composition-", str_pad(i, 3, pad = "0"), ".png") ),
width = 12, height = 15, dpi = 50  )

}

Afterwards (it takes a while to generate all images), I execute the following command in the folder with the images:

convert -delay 10 -loop 0 composition-*.png comp.gif

The loop option signifies how many times the .gif should loop around, 0 means that it loops indefinitely. The delay option gives in microseconds how long to wait until the next frame. One second has 100 microseconds so a delay of 10 should gives us 10 frames per seconds (consistent with the settings above).

And voilà, here is the final animated plot:

Full code.