See this tutorial as a Colab Notebook
This tutorial is designed to help you download your own GMail data and analyze it for sentiment. I have left out any results from this tutorial in the interest of privacy – I don’t want people who sent me emails to have their information displayed. Of course, the code is provided “as is”, without warranty of any kind.
First you’ll need to download your GMail data from Google here: https://takeout.google.com/
Remember that if you download your information, the file contains access to emails you’ve sent and/or received, including any passwords, social security numbers, etc contained within. Be safe and guard your data! This process can take a while depending on the amount of data you have.
If you don’t already have it installed, you’ll also need to install StanfordCoreNLP using the following instructions: https://stackoverflow.com/questions/32879532/stanford-nlp-for-python.
We need to import the relevant libraries:
import mailbox import pandas as pd import numpy as np import matplotlib.pyplot as plt from pylab import * from pycorenlp import StanfordCoreNLP
And load in the “.mbox” file containing your mailbox information that you downloaded from Google:
mb = mailbox.mbox('YourEmails.mbox')
Next we’ll define some functions we’ll need. The first function, findcharsets
finds the unusual character sets:
def findcharsets(msg): charsets = set({}) for c in msg.get_charsets(): if c is not None: charsets.update() return charsets
The following function
def getBody(msg): while msg.is_multipart(): msg=msg.get_payload()[0] t=msg.get_payload(decode=True) for charset in findcharsets(msg): t=t.decode(charset) return t
The following code loops through the messages and the keys associated with each one, as well as the text body of the emails, and put them into a Pandas dataframe. Initially we can start with just 20 emails, but you’ll want to increase that once you’re sure the basics are running.
mbox_dict = {} i=0 number_of_emails = 20 for i in range(0,number_of_emails): msg = mb[i] mbox_dict[i] = {} try: if(msg): for header in msg.keys(): mbox_dict[i][header] = msg[header] mbox_dict[i]['Body'] = getbody(msg).replace('\n',' ').replace('\r',' ').replace('\t',' ').replace('>','') except AttributeError: print('Attribute Error') except TypeError: print('TypeError') df = pd.DataFrame.from_dict(mbox_dict, orient='index')
Next, look at the contents of your dataframe and see if it has populated correctly with the name of your dataframe, df
You’ll see columns which are the keys associated with the emails, such as “X-Gmail-Labels”, “X-GM-THRID”, etc. One column, “Body”, will contain the actual content of the email.
The following will remove those emails with no content, select out only those emails you sent, and reset the index of the dataframe for good measure.
df = df[df['Body'].notnull()] df = df[df['From'].str.contains('Your Name Here')] df = df.reset_index()
Next up, we run StanfordCoreNLP locally and send it our emails one at a time for it to assign sentiment scores. What is actually happening here is it is assigning sentiment to each sentence in the email and then averaging it. This is okay for now, but we should keep in mind this may or may not be a good measure of the actual overall sentiment of the email.
nlp = StanfordCoreNLP('http://localhost:9000') dates = [] sentiments = [] for i, row in df.iterrows(): res = nlp.annotate(row['Body'],properties={'annotators': 'sentiment','outputFormat': 'json','timeout': 1000,}) sentiment_values = [] if(isinstance(res, dict)): for s in res["sentences"]: sentiment_values.append(s["sentimentValue"]) dates.append(row['Date']) sentiments.append(np.mean([int(j) for j in sentiment_values])) print("ave sent",np.mean([int(j) for j in sentiment_values]),'_',row['Body'][0:40])
Next, we parse the dates and change them to a easier format:
d = [parser.parse(x) for x in dates] d = [x.date() for x in d] d = [x.toordinal() for x in d]
Now we make the lists into numpy arrays, fit a line (just to see), and make a scatter plot, ordinal date vs sentiment, to see if there are any obvious trends over time:
x = np.array(d) y = np.array(sentiments) #This pulls only indices where both x and y are finite so we can fit a line idx = np.isfinite(x) & np.isfinite(y) #Fit a line to the data and print result ab = np.polyfit(x[idx], y[idx], 1) print(ab) #Plot the data plt.figure(figsize=(10,5)) plt.scatter(x[idx],y[idx],1,alpha=0.5) plt.title("Sentiment of Sent Emails") plt.show()
Here is a plot of the sentiment of my sent GMail emails, plotted over time, including email from August 2015 through September 2019. I didn’t notice any obvious long term trends, but I did notice increases in frequency of sent emails during certain periods. Unsurprisingly, these were around when I had major projects completing.
Once you have your data in this format, there are many possible analyses you might want to look at in more detail, like sent or recieved rates over time, response times as a function of time, who you are sending to or recieving emails from, etc.
Snippets and guidance came from the following sources:
Email import: https://pysd-cookbook.readthedocs.io/en/latest/data/Emails/Email_Data_Formatter.html
https://jellis18.github.io/post/2018-01-17-mail-analysis/
mbox body extraction: https://stackoverflow.com/questions/7166922/extracting-the-body-of-an-email-from-mbox-file-decoding-it-to-plain-text-regard
Stanford Sentiment analysis in python: https://stackoverflow.com/questions/32879532/stanford-nlp-for-python