Happy Valentine's Day

Motivation

Valentine’s day is coming up! What better gift is there to give/receive than the gift of data? B-)

Download data

I recently learned that anyone can download all of their Facebook data, so I decided to check it out and bring it into R. To access your data, go to Facebook, and click on the white down arrow in the upper-right corner. From there, select Settings, then, from the column on the left, “Your Facebook Information.” When you get the Facebook Information screen, select “View” next to “Download Your Information.” On this screen, you’ll be able to select the kind of data you want, a date range, and format. I only wanted my messages, so under “Your Information,” I deselected everything that. Note you can select the format JSON or HTML. I selected HTML. After you click “Create File,” it will take a while to compile – you’ll get a notification when it’s ready. You’ll need to reenter your password when you go to download the file. The result is a Zip file, which contains all the selected information.

setwd('/Users/juliejung/Documents/GitHub/jungjulie.com/content/post/2-Happy-Valentines-Day') #set working directory

First load the libraries you’ll need:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(rvest)
## Loading required package: xml2
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)

Take a look at your contacts. Who have you messaged?

#Specifying the url for desired website to be scraped
url <- "/Users/juliejung/Documents/GitHub/jungjulie.com/content/post/2-Happy-Valentines-Day/ToScrap/your_messages.html"
#Reading the HTML code from the website
raw <- read_html(url)

# this is the list of people I had messages with
people <- raw %>%
  html_nodes("._2lek a") %>% 
  html_text %>% 
  data_frame()

Now I want to just look at messages exchanged with K. First I’ll make an html session and iterate through my messages.

# making html session
KLsession <- html_session("file:///Users/juliejung/Documents/GitHub/jungjulie.com/content/post/2-Happy-Valentines-Day/ToScrap/KLmessage.html")
# iterating through them
KLwebpage <- read_html("/Users/juliejung/Documents/GitHub/jungjulie.com/content/post/2-Happy-Valentines-Day/ToScrap/KLmessage.html")

Then, I’ll use CSS selectors to scrap the sender names section. To do this I used the https://selectorgadget.com/ chrome plugin to find the html node I wanted. Then I converted the data to text and stuffed it in a dataframe.

KLsender_data <- KLwebpage %>%
  html_nodes("._2lel") %>% ##Using CSS selectors to scrap the sender names section; Used the SelectorGadget chrome plugin to find the html node I wanted. 
  html_text() %>% #Converting the data to text. 
  data_frame()

Next, I renamed the column “sender” from the default “.” - which was notably less helpful.

KLsender_data <- KLsender_data %>% 
  rename(sender = ".")

Now you can clear the selector section and select all the messages. You can visually inspect that all the messages are selected. Make any required additions and deletions with the help of your curser.

KLmessage_data <- KLwebpage %>%
  html_nodes("._2let div:nth-child(2)") %>%
  html_text() %>%
  data_frame()

Rename the column “message”

KLmessage_data <- KLmessage_data %>% 
  rename(message = ".")

Do the same for time.

KLwhen_data <- KLwebpage %>%
  html_nodes("._2lem") %>%
  html_text() %>%
  data_frame()

Rename the column “when”

KLwhen_data <- KLwhen_data %>% 
  rename(when = ".")

Next reformat the dates to something more useful.

KLwhen_data <- KLwhen_data %>%
  separate(when, c("monthday", "year", "time"), ",")

KLwhen_data <- KLwhen_data %>%
  separate(monthday, c("month", "day"), " ")

Now we have successfully scraped all the features from the webpage that we want. Let’s combine them to create a comprehensive dataframe!

#Combining all the lists to form a data frame
KL_df<-data.frame(Sender = KLsender_data, Message = KLmessage_data, When = KLwhen_data)

Visualize the data for the first time! Here I’ve grouped our message frequency by month, and reordered from less to more.

KL_df %>% 
  group_by(When.month) %>% 
  summarise(n = n()) %>% 
  ggplot(aes(x = reorder(When.month, n), y = n)) +
  geom_col(aes(fill = When.month))

Here we can see that we exchange messages way more often in the winter months compared to the summer months. This is most definitely an artifact of my research schedule. For the entire lifespan of our relationship (since 2015), I’ve been conducting field research on frogs in Panama during the summer months (June to Aug), during which we switch communication modes from fb messenger which takes more international data to something like viber or whatsapp. We also usually take a little vacation together either before or after the field season to make up for the time we’re apart, which might explain why Sept and May are the two months with the lowest message frequency.

To explore these data further, I converted from 12-hour-character-time to 24 hour time.

KL_df$When.time <- format(strptime(KL_df$When.time, "%I:%M %p"), format="%H:%M:%S")

Then again to day month year format.

# putting it in day month year
KL_df <- KL_df %>% 
  mutate(dmy = paste(When.day, When.month, When.year, sep = " "))

# getting the dates properly formatted 
KL_df$dmy <- as_date(x = KL_df$dmy, format = "%d %B %Y", tz = 'EST')

This is a chart of number of messages per day, throughout the history of our relationship. I can see patterns like a general uptick in messages but it’s too granular to be very useful.

KL_df %>% 
  group_by(dmy) %>% 
  summarise(n = n()) %>% 
  arrange(dmy) %>% 
  ggplot(aes(x = dmy, y = n)) +
  geom_col()

This cuts the data up by week, which is more useful!

KL_df$week <- as.Date(cut(KL_df$dmy, breaks = "week"))

KL_df %>% 
  group_by(week) %>% 
  summarise(n = n()) %>% 
  arrange(week) %>% 
  ggplot(aes(x = week, y = n)) +
    geom_col()

This cuts the data up by month, also useful!

# same thing, cutting by month
KL_df$month_cut <- as.Date(cut(KL_df$dmy, breaks = "month"))

# chart of monthly messages over time
KL_df %>% 
  group_by(month_cut) %>% 
  summarise(n = n()) %>% 
  arrange(month_cut) %>% 
  ggplot(aes(x = month_cut, y = n)) +
  geom_col() +
  scale_x_date(date_breaks = "1 month") +
  labs(title = "Facebook Messages Exchanged Over Time",
       subtitle = "With One Person Since 2015",
       x = "Date",
       y = "Number of FB Messages Exchanged") +
  theme(axis.text.x = element_text(angle = 70, hjust = 1),
        panel.background = element_rect(fill = "lightblue"),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank())

KL_df$message<-as.character(KL_df$message)
KL_df$message<-tolower(KL_df$message)

write.csv(KL_df,'kl-jj-msg-mining.csv')

Making a prettier graphic:

library(ggplot2);library(ggrepel); library(extrafont); library(ggthemes);library(reshape);library(grid);
## Registering fonts with R
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:lubridate':
## 
##     stamp
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
## The following object is masked from 'package:dplyr':
## 
##     rename
library(scales);library(RColorBrewer);library(gridExtra);
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
my_theme <- function() {
  # Define colors for the chart
  palette <- brewer.pal("Greys", n=9)
  color.background = palette[2]
  color.grid.major = palette[4]
  color.panel = palette[3]
  color.axis.text = palette[9]
  color.axis.title = palette[9]
  color.title = palette[9]
  # Create basic construction of chart
  theme_bw(base_size=9, base_family="Palatino") + 
  # Set the entire chart region to a light gray color
  theme(panel.background=element_rect(fill=color.panel, color=color.background)) +
  theme(plot.background=element_rect(fill=color.background, color=color.background)) +
  theme(panel.border=element_rect(color=color.background)) +
  # Format grid
  theme(panel.grid.major=element_line(color=color.grid.major,size=.25)) +
  theme(panel.grid.minor=element_blank()) +
  theme(axis.ticks=element_blank()) +
  # Format legend
  theme(legend.position="bottom") +
  theme(legend.background = element_rect(fill=color.background)) +
  theme(legend.text = element_text(size=8,color=color.axis.title)) + 
  theme(legend.title = element_blank()) + 
  
  #Format facet labels
  theme(strip.text.x = element_text(size = 8, face="bold"))+
  # Format title and axes labels these and tick marks
  theme(plot.title=element_text(color=color.title, size=28)) +
  theme(axis.text.x=element_text(size=8)) +
  theme(axis.text.y=element_text(size=8)) +
  theme(axis.title.x=element_text(size=8)) +
  theme(axis.title.y=element_text(size=8)) +
  #Format title and facet_wrap title
  theme(strip.text = element_text(size=8), plot.title = element_text(size = 16, colour = "black", vjust = 1, hjust=0))+
    
  # Plot margins
  theme(plot.margin = unit(c(.2, .2, .2, .2), "cm"))
}
KL_df %>% 
  group_by(month_cut) %>% 
  summarise(n = n()) %>% 
  arrange(month_cut) %>% 
  ggplot(aes(x = month_cut, y = n)) +
  my_theme() +
  geom_point(size=1) +
  geom_line(size=0.6)+
  scale_x_date(labels = date_format("%b %Y"), date_breaks = "1 month")+
  labs(title = "Facebook Messages Exchanged Over Time",
       subtitle = "With One Person Since 2015",
       x = "Date",
       y = "Number of FB Messages Exchanged") +
  theme(axis.text.x = element_text(angle = 70, hjust = 1))

It has been fun to match up dates with special events & life occurances. You could even add this info in your plot interactively if you want. I initially used Plotly to introduce these elements.. but as this post has garnered more page views than I expected, I’ve edited this section out since it contained more personal info than I wanted the public sphere of the world wide web to know :P but I’ll make a separate post about interactive plots with Plotly soon!

Next steps

I would love to be able to use the tm package to further analyze the words in our messages, but it requires a newer version of R than I can currently upgrade to on my ancient laptop :’( When I figure out how to install an old version of the package (or finally get a new laptop) I plan on text stemming, making a pretty word cloud, and performing sentiment analysis on our messages. I’m particularly interested in when we started using the word “love” and whether our message frequency is inversely correlated with the number of messages exchanged between me and my mom. My umma has this hypothesis that I communicate with her much more when I’m single - I’d like to see if she’s on to something. It might be trickier than I think to figure this out because my mom and I message in Korean about as much as we do in English.. I’m not sure if the methods available to use are language dependent. It’ll be fun to find out!