Analysing the Enron Email Corpus
0. Introduction to NLP and Sentiment Analysis
1. Natural Language Processing with NTLK
3. Build a sentiment analysis program
4. Sentiment Analysis with Twitter
5. Analysing the Enron Email Corpus
6. Build a Spam Filter using the Enron Corpus
1st the videos, transcript is below
The Enron Email Corpus is one of the biggest email data sources in the world. Almost half a million files spread over 2.5 GB. Normally, emails are very sensitive, and rarely released to the public, but because of the shocking nature of Enron’s collapse, everything was released to the public.
Because it is so large, it makes analysis complicated. The question always is: Where do we even start?
In this first video, I given an introduction to Enron, and the email corpus. Since the collapse was only 15 years ago(its 2016 now), I guess every reading this has heard of Enron, a company that was the top biggest company in the world one day, and bankrupt the next.
Download the emails from here. On Windows, you’ll need 7zip to unzip them.
Once you download the files, spend some time looking at their structure, and how they are arranged. You will find the actual emails are in MIME format.
We won’t be using Ipython in this example, as it kept crashing (& sometimes taking down Firefox). I guess it couldn’t handle the Gigabytes of data we have to work with.
You will need a code editor. I recommend VS Code.
For running the scripts, you will need to use the command line. On Windows, I recommend Powershell, a super charged version of the old dos prompt. Linux/Mac can stick to their default terminals.
Let’s get started. The dataset is huge, so we will go step by step. As the old joke goes, how do you eat an elephant?
One bite at a time!
We will take mini steps. Some of these might seem easy, but still follow them, as they will combine into the final whole.
Loop through all the files in the dataset
All the code is here. The file is Enron1.py
To start off, we will see how to loop through all the files and folders in the email set.
We import the os library:
import os
In the next step, we need to provide a link to where the Enron emails are:
rootdir = "C:\\Users\\Shantnu\\Desktop\\Data Sources\\maildir\\lay-k"
Warning for Windows Users: Make sure you replace the slash (*) with a double slash (\*).
Also note that we are starting in the lay-k directory. These are the emails for Kenneth Lay, one time CEO and director, who went to prison. We are doing it this way to speed up our code, as otherwise it may take a long time to run.
We next use the os.walk() function to loop over the directories:
for directory, subdirectory, filenames in os.walk(rootdir):
print(directory, subdirectory, len(filenames))
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k ['all_documents', 'business', 'calendar', 'compaq', 'deleted_items',
'discussion_threads', 'elizabeth', 'enron', 'family', 'inbox', 'notes_inbox', 'sec_panel', 'sent', 'sent_items', '_sent
'] 0
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k\all_documents [] 1127
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k\business [] 2
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k\calendar [] 8
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k\compaq [] 1
C:\Users\Shantnu\Desktop\Data Sources\maildir\lay-k\deleted_items [] 1126
This will print the directory, sub directory and length of files (I’m not printing the actual files, as that messes the output, but the os.walk function actually returns each file it finds).
This is it for the first example. It was very short, but it shows us how we can loop over all the files in the email set. Now, change the root directory, so that instead of pointing to the kay-l folder, it points to the main one, and run the code again.
You will see it takes a fair time to run. If just printing the directories and files takes so long, how long would analysing them take?
Next, we learn how to open the emails.
Opening the Emails in Python
We work with enron2.py.
Go to the lay-k folder, then all_documents, and open file 1.
Have a look at the Mime structure.
Originally, I was going to parse the file by hand (using regexes). I even wrote some code for it, but it was hundreds of lines long.
I searched around till I found a good library to open emails in Python, and this reduced the code from hundreds to tens of lines.
from email.parser import Parser
The email.parser is the library we want.
file_to_read = "C:\\Users\\Shantnu\\Desktop\\Data Sources\\maildir\\lay-k\\all_documents\\1"
Again, put a link to the file you want to read. Windows users, remember to use double slashes. This is the same file you viewed earlier. Let’s open the file:
with open(file_to_read, "r") as f:
data = f.read()
We now create an email parser instance:
email = Parser().parsestr(data)
Once that is done, getting the to, from, subject etc fields is fairly easy.
print("\nTo: " , email['to'])
print("\n From: " , email['from'])
print("\n Subject: " , email['subject'])
To: mmilken@knowledgeu.com
From: rosalee.fleming@enron.com
Subject: Re: testing
The only thing slightly confusing is how to get the body. You need to call a function for that:
print("\n \n Body: " , email.get_payload())
Body: Hi -
We did receive the e-mail.
Rosalee for Ken Lay
"Michael Milken" <mmilken@knowledgeu.com> on 07/02/99 10:21:40 AM
To: Kenneth Lay/Corp/Enron@Enron
cc:
Subject: testing
Okay, now that we know how to loop over every single file in the dataset, and open emails, let’s combine the two. We will see who sent and received the most emails. We will also do some basic text analysis of Kenneth Lays emails.
Analyse the Emails
The file is Enron3.py.
import os
from email.parser import Parser
rootdir = "C:\\Users\\Shantnu\\Desktop\\Data Sources\\maildir\\lay-k\\family"
We import what we need. This time, we are working in the lay-k\family folder, as it contains the fewest files, and will allow us to run our script quickly.
I have written a function to extract the data from the emails. Let’s look at the whole function, and then we’ll study it line by line:
def email_analyse(inputfile, to_email_list, from_email_list, email_body):
with open(inputfile, "r") as f:
data = f.read()
email = Parser().parsestr(data)
to_email_list.append(email['to'])
from_email_list.append(email['from'])
email_body.append(email.get_payload())
Okay, so the function is:
def email_analyse(inputfile, to_email_list, from_email_list, email_body):
email_analyse() takes in 4 parameters: the input email file, a list of all To emails, a list for From emails, and a list that contains the body(text) of the email.
The first thing to do is open the file, and parse the email with our parser:
with open(inputfile, "r") as f:
data = f.read()
email = Parser().parsestr(data)
Now that we have created our email instance, we can read the different fields and store them in the lists:
to_email_list.append(email['to'])
from_email_list.append(email['from'])
email_body.append(email.get_payload())
We append the 3 fields to their respective lists. So that the email[‘to’] is appended to the to_email_list.
These lists are passed in, which means that they will be modified and returned. So that each time you call this function, it will add to the list, and the list will grow each time the function is called.
Now we come to the main code:
to_email_list = []
from_email_list = []
email_body = []
We create the 3 empty lists that will be passed to our function.
Next, we call the os.walk() function we saw earlier:
for directory, subdirectory, filenames in os.walk(rootdir):
for filename in filenames:
email_analyse(os.path.join(directory, filename), to_email_list, from_email_list, email_body )
The code is fairly easy. We just loop over all the directories and files, and call our email_analyse function each time. The part you might not get is:
os.path.join(directory, filename)
We are doing it this way, because we want the complete path to the file. Normally, filename will just contain the name, like 1, 2 etc. But for our open() function, we need the whole path.
So the way the code works is, it loops over every single file in the directory and opens it, extracts the 3 fields we want, and returns them in the list. So far, so easy.
There is a problem with the code.
To show you why, we will write our lists to file:
with open("to_email_list.txt", "w") as f:
for to_email in to_email_list:
if to_email:
f.write(to_email)
f.write("\n")
We loop over the to_email_list, and write each element to file. Some of these elements maybe None due to parsing problems, or corrupt emails, which is why we check this:
if to_email:
I am writing a newline (\n) after each email, so that each email is written to a separate line in the file.
Similarly for the other lists:
with open("from_email_list.txt", "w") as f:
for from_email in from_email_list:
if from_email:
f.write(from_email)
f.write("\n")
with open("email_body.txt", "w") as f:
for email_bod in email_body:
if email_bod:
f.write(email_bod)
f.write("\n")
Let’s see the files we created. Ignore the body text email for now. Let’s open from_email_.txt:
beau@rrhinvestments.com
beau@rrhinvestments.com
mrslinda@lplpi.com
mrslinda@lplpi.com
mrslinda@lplpi.com
nlay@att.net
nlay@att.net
sally.keepers@enron.com
That is correct. One email per line.
Let’s open the to_email.txt:
kenneth.lay@enron.com
kenneth.lay@enron.com
kenneth.lay@enron.com
klay@enron.com, poppopindc@aol.com, dherrold@enlineresources.com,
nlay@worldnet.att.net, lizard_ar@yahoo.com,
marlenen@rowefurniture.com, aka@zdimensional.com,
rnegri@mindspring.com, bourneb@umsystem.edu, sharon@travelpark.com,
jessi020674@aol.com, jimemerson@hotmail.com,
......... snipped ...................
The first three lines are correct, but in the fourth one, there are a dozen emails in one line. Our code didn’t work. Why not?
In the lay-k/family folder, open up the file 4.
enron1
You will see that the To field contains dozens of email addresses. When we parse that file, all those email addresses are written as one entry in the list, when each should be added separately.
Fixing our Code
Files: Enron4.py and Enron5.py. The latter is a version of the former that works on the whole dataset, and so has the write to file code removed, to prevent the script taking days to finish.
The video goes into detail on how and why I made the choices I did. In the transcript, I will only give the final version of the code.
Most of the code is similar to the last example, so I will only show the new code.
In the function email_analyse, we add this code:
if email['to']:
email_to = email['to']
email_to = email_to.replace("\n", "")
email_to = email_to.replace("\t", "")
email_to = email_to.replace(" ", "")
Like I said earlier, some of the fields are empty (due to data corruption, old Mime formats that can’t be parsed, we don’t know). So we check if the email[‘to’] field exists.
If it does, we remove all the newline characters (\n), all the tabs (\t) and empty spaces. This is to ensure the code is parsed correctly. So after our replacing, something like:
klay@enron.com, poppopindc@aol.com, dherrold@enlineresources.com,
nlay@worldnet.att.net, lizard_ar@yahoo.com,
marlenen@rowefurniture.com, aka@zdimensional.com,
will become
klay@enron.com,poppopindc@aol.com,dherrold@enlineresources.com,nlay@worldnet.att.net,lizard_ar@yahoo.com,marlenen@rowefurniture.com,aka@zdimensional.com,
All special characters are gone. We can now extract all the individual emails:
email_to = email_to.split(",")
This will return a list with all the emails in the file (as the emails are separated by a comma). Note that if the file only had one email, the code will still work, as it will return a list with just one element.
We now loop over the list and append it to the to_email_list:
for email_to_1 in email_to:
to_email_list.append(email_to_1)
The rest of the function is the same.
In the main code, we remove the code to write to a file, as we don’t need it. Instead, we want to count the emails, to find out who the top 10 most popular addresses are.
We will use the Counter library for this.
print(Counter(to_email_list).most_common(10))
We convert the to_email_list to Counter object. This will count all the unique emails in the list. When then call the most_common() function to get the ten most common emails.
The final code is:
print("\nTo email adresses: \n")
print(Counter(to_email_list).most_common(10))
print("\nFrom email adresses: \n")
print(Counter(from_email_list).most_common(10))
Remember, we are still in the lay-k (or Kenneth Lay’s) folder. The result is:
To email adresses:
[('kenneth.lay@enron.com', 2039), ('klay@enron.com', 1903), ('jeff.skilling@enron.com', 372), ('mark.koenig@enron.com',
313), ('mark.frevert@enron.com', 304), ('greg.whalley@enron.com', 304), ('steven.kean@enron.com', 278), ('mike.mcconnell
@enron.com', 261), ('jeffrey.mcmahon@enron.com', 251), ('john.sherriff@enron.com', 244)]
From email adresses:
[('rosalee.fleming@enron.com', 856), ('brown_mary_jo@lilly.com', 82), ('leonardo.pacheco@enron.com', 78), ('savont@email
.msn.com', 66), ('tori.wells@enron.com', 58), ('elizabeth.davis@compaq.com', 50), ('no.address@enron.com', 47), ('kather
ine.brown@enron.com', 47), ('mrslinda@lplpi.com', 40), ('lizard_ar@yahoo.com', 36)]
Most of the emails were sent with Lay’s own email address (as we are looking at his data). It seemed he had two email addreses: kenneth.lay@enron.com and klay@enron.com
The most emails he received were from Rosalee Fleming.
If you look at Enron5.py, it’s the same file, but the root directory is the whole email set. This means it will take longer to run. I have removed the write to file functions, as otherwise the code takes a whole day or two to run. If you run this file, it will find the most common to and from emails in the whole company.
This is the result:
To:
[('richard.shapiro@enron.com', 15149), ('jeff.dasovich@enron.com', 14207), ('tana.jones@enron.com', 12828), ('steven.kea
n@enron.com', 12754), ('sara.shackleton@enron.com', 11433), ('james.steffes@enron.com', 10347), ('mark.taylor@enron.com'
, 9787), ('pete.davis@enron.com', 9281), ('susan.mara@enron.com', 9064), ('paul.kaufman@enron.com', 8522)]
From:
[('kay.mann@enron.com', 16735), ('vince.kaminski@enron.com', 14368), ('jeff.dasovich@enron.com', 11411), ('pete.davis@en
ron.com', 9149), ('chris.germany@enron.com', 8801), ('sara.shackleton@enron.com', 8777), ('enron.announcements@enron.com
', 8587), ('tana.jones@enron.com', 8490), ('steven.kean@enron.com', 6759), ('kate.symes@enron.com', 5438)]
You can Google these people to know who they were. In the To list, the top person is Richard Shapiro. He was the Vice President and lobbyist (“bribery guy”) for Enron. A lot of his emails are about handing dollars to politicians, and getting favourable laws passed. The fact that he received the most emails shows he was in touch with everything that was happening.
The highest From field, which is emails received from, is Kay Mann, who was the head of legal for Enron. The fact that she sent so many emails is ironical, seeing as how Enron was breaking every law in the book (keep in mind that most employees, including Kay Mann, were innocent. Only the top executives were guilty, and most went to prison).
But I still found this fact very funny. A company with such a active legal department, and yet the executives ignored (or didn’t care) about the law at all.
Right, to the final analysis.
Most Common Words in Kenneth Lay’s emails
Make sure you’ve done the NLTK intro (and its part 2) before tackling this.
If you remember, we wrote all of K Lay’s emails to a file. Let’s see if we can analyse that.
I’ve renamed the file email_body.txt to ken_lay_emails.txt.
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
We import our nltk libraries.
with open("ken_lay_emails.txt", "r") as f:
data = f.read()
We read the file.
words= word_tokenize(data)
useful_words = [word for word in words if word not in stopwords.words('English')]
Next we tokenize it, and remove all the stop words.
Finally, we find the frequency of words:
frequency = nltk.FreqDist(useful_words)
print(frequency.most_common(100))
[(',', 82443), ('.', 59781), ('--', 27631), ('?', 26239), ('>', 22136), (':', 21640),
('@', 20084), ('Enron', 14536), ('I', 13970), ("'s", 10955), ("''", 10811),
(')', 10030), ('(', 9479), ('<', 7817), ('-', 7340), ('The', 7138), (';', 5981),
('company', 5943), ('``', 5721), ('employees', 5234), ('$', 4969), ('energy', 4437),
('To', 4020), ('made', 3994), ('=20', 3921), ('would', 3843), ('California', 3734),
('Lay', 3468), ('consumers', 3441), ('Ken', 3402), ('http', 3213), ('Please', 3200),
('We', 3051), ('&', 3048), ('million', 2866), ('!', 2678), ('stock', 2670), ('Mr.', 2621), ('pay', 2491), ('...', 2444)
, ('funds', 2393), ('retirement', 2329), ('bills', 2298), ('bankruptcy', 2297), ('millions', 2294),
('donate', 2260), ('declared', 2250), ("n't", 2217), ('year', 2201), ('As', 2199), ('New', 2190), ('ENRON', 2182), ('help', 2154),
('information', 2095), ('know', 2086), ('last', 2060), ('time', 2039), ('*', 2036), ('well', 1998), ('If', 1933), ('us', 1911),
('many', 1899), ("'", 1879), ('please', 1869), ('new', 1821), ('provide', 1818), ('This', 1789), ('Subject', 1783), ('2000', 1763), ('like', 1749),
('business', 1706), ('meeting', 1701), ('%', 1678), ('Houston', 1658), ('York', 1558), ('2001', 1544), ('keep', 1521), ('=', 1521), ('ECT', 1519),
("'m", 1517), ('also', 1516), ('efforts', 1504), ('And', 1488), ('In', 1475), ('one', 1458), ('enron.com', 1449), ('money', 1426), ('state', 1423),
('set', 1422), ('Sincerely', 1398), ('call', 1385), ('100', 1373), ('October', 1363), ('result', 1362), ('largest', 1354), ('may', 1353),
('plans', 1352), ('Times', 1351), ('A', 1297), ('Communications', 1294)]
Some words like communication and energy are expected, as that’s what Enron did.
But I kept seeing million, millions and interestingly, bankruptcy.
The word California was mentioned 3734 times. The company had some (bad, I think) history there.
Finally, the word meeting was mentioned 1700 times. K Lay sure was a busy man.
And that’s it. This was a quick intro to the Enron corpus.
While doing this analysis, I was struck by how long it took to analyse the whole thing. Combining all the words in all the emails took a whole day for me (of course, the final file was almost a gigabyte in size).
This led me to believe I should introduce lessons on multi-processing earlier than I was planing. We’ll come back to the Enron email corpus again in the future.