When excel files get delivered through e-mail attachments, how can we extract the data and consolidate into a single table?
Here I present an automated process to extract the attachments from Outlook emails and consolidate them using R. I use the RDCOMClient
package (github.com/omegahat/RDCOMClient or stat.berkeley.edu/~nolan/stat133/Fall05/lec..). Thus, this solution will work only on Windows.
For instance, my research team is stationed at remote areas, where they have no access to internet. They record measurements hourly and record them in an excel template. At the end of the day, they will email me the excel file as an attachment with a common subject (REA0001 - Measurements
). I need to extract these hourly measurements into one table for analysis. If I receive 50 emails in a day, am I going to manually open each email attachment and, copy the data into a file? This is a very time consuming, tedious and error prone approach. Thus, I would use the following piece of code to automate my job.
Step 1 : Extract emails with the same subject from outlook mailbox
In this example, every email is sent by the same subject. Therefor, I use the subject to search for the email in the mail box. Also the Outlook application needs to be opened while running the code.
library(RDCOMClient)
library(dplyr)
library(stringr)
working_dir<-"C:/Users/geethika.wijewardena/Workspace/R-extract-email-attachments/"
#--------------------------------------------
# Extract emails from outlook
#--------------------------------------------
# Create a new instance of Outlook COM server class
outlook_app <- COMCreate("Outlook.Application")
# Create a search object to search the mail box by given criteria (e.g. subject)
search <- outlook_app$AdvancedSearch(
"Inbox",
"urn:schemas:httpmail:subject = 'REA0001 - Measurements'"
)
# Allow some time for the search to complete
Sys.sleep(5)
results <- search$Results()
Step 2: Filter emails by date and extract data in attachment
The results
object above contains all emails with the given subject. However, since I need only the ones I received today, I filter the emails by date. Next, for each email, I save the attachment. I present two approaches to save the attachment by:
- a) filename of the attachment and
- b) name of the sender (in case if the filename is inconsistent).
In approach (a), each saved attachment is read and loaded into a dynamic variable of its filename within the loop The Measurement
field is renamed by the filename. Later, all these tibbles are joined/consolidated into a single table.
#------------------------------------------------------------------------------
# Approach (a)
# Extract emails and save the attachment by the name of the attachment
#------------------------------------------------------------------------------
# Filter search results by receive date
for (i in 1:results$Count()){
receive_date <- as.Date("1899-12-30") + floor(results$Item(i)$ReceivedTime())
if(receive_date >= as.Date("2019-10-09")) {
# Get the attachment of each email and save it by the name of the attachment
# in a given file path
email <- results$Item(i)
attachment_file <- paste0(working_dir,email$Attachments(1)[['DisplayName']])
email$Attachments(1)$SaveAsFile(attachment_file)
# Read each attachment and assign data into a variable (which is the filename)
# generated dynamically,
df_name <- str_sub(email$Attachments(1)[['DisplayName']],1,-6)
data <- readxl::read_excel(attachment_file, col_types =c("date", "numeric"),
col_names = T) %>%
rename(!!df_name := "Case")%>%
mutate(Hour = str_sub(as.character(Hour),11,nchar(as.character(Hour))))
assign(df_name, data)
}
}
# Consolidate all dataframes into one
dat <- lapply(ls(pattern="REA"), function(x) get(x)) %>%
purrr::reduce(full_join, by = "Hour")
In approach (b), getDataFromEmailAtt() function filters each email by the date, saves them by the name of the sender and returns the tibble with the Measurement
field renamed by the sender's name. This function is called within a loop which joins/consolidates each data set into one table.
#------------------------------------------------------------------------------
# Approach (b)
# Extract emails and save the attachment by the name of the sender
#------------------------------------------------------------------------------
getDataFromEmailAtt<- function(results, i){
# Function to extract data from email attachement, save it it a specified
# directory by the name of the sender, read the saved excel file and return
# a dataframe with a given colum named by the sender's name.
# Args: results - object returned by search$Results() of RDCOMClient for
# outlook applications.
# i - order number of the extracted emails in the results object
# Returns: Dataset of the email attachment with given column renamed by the
# sender's name
#
receive_date <- as.Date("1899-12-30") + floor(results$Item(i)$ReceivedTime())
if(receive_date >= as.Date("2019-10-09")) {
# Get the attachment of each email and save it by the name of the attachment
# in a given file path
email <- results$Item(i)
attachment_file <- paste0(working_dir,email[['SenderName']],'.xlsx')
email$Attachments(1)$SaveAsFile(attachment_file)
data <- readxl::read_excel(attachment_file, col_names = T) %>%
rename(!!df_name := "Measurement")%>%
mutate(Hour = str_sub(as.character(Hour),11,nchar(as.character(Hour))))
return(data)
}
}
# Get the first dataset
dat <- getDataFromEmailAtt(results, i=1)
# Append datasets of the other emails
for (i in 2:results$Count()){
data <- getDataFromEmailAtt(results, i)
dat <- dat %>% inner_join(data, by=c('Hour'))
Material of this example is at my GitHub repo github.com/geethika01/R-extract-email-attac...