Skip to content

Week 4

hackmd-github-sync-badge

In this lesson, you learned how to implement batch processing (i.e., not real-time) using a typical batch processing workflow techniques. You will also gain an understanding of how frameworks such as Hadoop, Spark, and TensorFlow parallelize certain computational tasks.

Objectives

After completing this week, you should be able to:

  • Create a scalable, end-to-end machine learning pipeline using Spark

Readings

  • Read chapter 10 in Designing Data-Intensive Applications
  • Read chapter 3 in Deep Learning with Python

Weekly Resources

Assignment 4

In this assignment, you will be creating a workflow to process emails from Enron that were made available by the Federal Energy Regulatory Commission during its investigation of the company. The original data is not in a machine-friendly format, so we will use Python’s built-in email package to read the emails and create a machine-friendly dataset.

The data/external/enron folder contains a partial copy of the original Enron email dataset (you can download the full dataset here). Each folder represents a single users' email account. Each one of those folders contains that user's top-level folders and those folders contain the individual emails. The following is the directory structure of a single user folder.

enron/zipper-a
├── all_documents
│   ├── 1.
│   ├── 10.
│   ├── 11.
│   ├── 12.
│   ├── 13.
│   ├── 14.
.
.
.
│   ├── 8.
│   └── 9.
└── tss
    ├── 1.
    ├── 10.
    ├── 11.
    ├── 12.
        .
        .
        .
    ├── 4.
    ├── 5.
    ├── 6.
    ├── 7.
    ├── 8.
    └── 9.

Looking at the example of /enron/zipper-a/inbox/114. demonstrates the email structure. The email starts with standard email headers and then includes a plain text message body. This is typical of most of the emails except some email bodies being encoded in HTML.

Message-ID: <6742786.1075845426893.JavaMail.evans@thyme>
Date: Thu, 7 Jun 2001 11:05:33 -0700 (PDT)
From: jeffrey.hammad@enron.com
To: andy.zipper@enron.com
Subject: Thanks for the interview
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Hammad, Jeffrey </O=ENRON/OU=NA/CN=RECIPIENTS/CN=NOTESADDR/CN=CBBE377A-24F58854-862567DD-591AE7>
X-To: Zipper, Andy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=AZIPPER>
X-cc: 
X-bcc: 
X-Folder: \Zipper, Andy\Zipper, Andy\Inbox
X-Origin: ZIPPER-A
X-FileName: Zipper, Andy.pst

Andy,

Thanks for giving me the opportunity to meet with you about the Analyst/ Associate program.  I enjoyed talking to you, and look forward to contributing to the success that the program has enjoyed.  

Thanks and Best Regards,

Jeff Hammad

For this assignment, you will be parsing each one of those emails into a structured format. The following are the fields that should appear in the structured output.

Field Description
username The username of this mailbox (the name of email folder)
original_msg The original, unparsed email message
payload The unparsed payload of the email
Message-ID 'Message-ID' from email header
Date Parsed datetime from email header
From 'From' from email header
To 'To' from email header
Subject 'Subject' from email header
Mime-Version 'Mime-Version' from email header
Content-Type 'Content-Type' from email header
Content-Transfer-Encoding 'Content-Transfer-Encoding' from email header
X-From 'X-From' from email header
X-To 'X-To' from email header
X-cc 'X-cc' from email header
X-bcc 'X-bcc' from email header
X-Folder 'X-Folder' from email header
X-Origin 'X-Origin' from email header
X-FileName 'X-FileName' from email header
Cc 'Cc' from email header
Bcc 'Bcc' from email header

The dsc650/assignments/assignment04 folder contains partially completed code and placeholder files for this assignment.

Assignment 4.1

Load the data from the enron.zip dataset into a Spark dataframe. See the Spark SQL Getting Started Guide for information details on how to create a dataframe. The final dataframe should have the following schema.

df.printSchema()

root
 |-- id: string (nullable = true)
 |-- username: string (nullable = true)
 |-- original_msg: string (nullable = true)

Assignment 4.2

Implement a function that takes the path to an email file and returns a dictionary containing the fields listed in the previous table. The folder dsc650/assignments/assignment04/examples contains examples of messages with both plain and HTML message payloads. It is recommended that you start by parsing these examples first to ensure your read_email function is working properly.

The following is an example of parsing a plain email message.

plain_msg_example = """
Message-ID: <6742786.1075845426893.JavaMail.evans@thyme>
Date: Thu, 7 Jun 2001 11:05:33 -0700 (PDT)
From: jeffrey.hammad@enron.com
To: andy.zipper@enron.com
Subject: Thanks for the interview
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Hammad, Jeffrey </O=ENRON/OU=NA/CN=RECIPIENTS/CN=NOTESADDR/CN=CBBE377A-24F58854-862567DD-591AE7>
X-To: Zipper, Andy </O=ENRON/OU=NA/CN=RECIPIENTS/CN=AZIPPER>
X-cc: 
X-bcc: 
X-Folder: \Zipper, Andy\Zipper, Andy\Inbox
X-Origin: ZIPPER-A
X-FileName: Zipper, Andy.pst

Andy,

Thanks for giving me the opportunity to meet with you about the Analyst/ Associate program.  I enjoyed talking to you, and look forward to contributing to the success that the program has enjoyed.  

Thanks and Best Regards,

Jeff Hammad
"""
parsed_msg = parse_email(plain_msg_example)
print(parsed_msg.text)

Which yields the following.

Andy,

Thanks for giving me the opportunity to meet with you about the Analyst/ Associate program. I enjoyed talking to you, and look forward to contributing to the success that the program has enjoyed.

Thanks and Best Regards,

Jeff Hammad

Assignment 4.3

Finally, you will put together a feature extraction workflow using Spark Pipelines. You will use Spark's MLlib to extract words from the email text and then convert those words into numeric features using a count vectorizor.

result.select('id', 'words', 'features').show()

+--------------------+--------------------+--------------------+
|                  id|               words|            features|
+--------------------+--------------------+--------------------+
|        shively-h/2_|[, what, is, this...|      (3,[0],[16.0])|
|        shively-h/1_|[can, you, please...|(3,[0,1,2],[5.0,1...|
|shively-h/peoples...|[pgl, and, north,...|(3,[0,1,2],[10.0,...|
|shively-h/peoples...|[, pgl, and, nort...|(3,[0,1,2],[51.0,...|
|shively-h/peoples...|[, pgl, and, nort...|(3,[0,1,2],[53.0,...|
|shively-h/peoples...|[pgl, and, north,...|(3,[0,1,2],[10.0,...|
|shively-h/peoples...|[, pgl, and, nort...|(3,[0,1,2],[96.0,...|
|shively-h/peoples...|[pgl, and, north,...|(3,[0,1,2],[13.0,...|
|shively-h/peoples...|[pgl, and, north,...|(3,[0,1,2],[13.0,...|
| shively-h/inbox/60_|[i, just, wanted,...|(3,[0,1,2],[47.0,...|
|shively-h/inbox/134_|[this, is, an, al...|(3,[0,1,2],[6.0,3...|
| shively-h/inbox/70_|[, please, find, ...|(3,[0,1,2],[8.0,7...|
|shively-h/inbox/118_|[frank, ermis, -,...| (3,[0,2],[9.0,1.0])|
| shively-h/inbox/28_|[dear, body, shop...|(3,[0,1,2],[8.0,3...|
|shively-h/inbox/178_|[, as, you, know,...|(3,[0,1,2],[10.0,...|
| shively-h/inbox/73_|[hunter, shively,...|(3,[0,1],[13.0,3.0])|
|  shively-h/inbox/7_|[hunter, --, this...|(3,[0,1,2],[11.0,...|
| shively-h/inbox/31_|[this, is, very, ...|(3,[0,1,2],[27.0,...|
| shively-h/inbox/74_|[, , , -----origi...|(3,[0,1,2],[33.0,...|
| shively-h/inbox/97_|[hunter,, , i, wi...|(3,[0,1,2],[11.0,...|
+--------------------+--------------------+--------------------+
only showing top 20 rows

Submission Instructions

If you are using Jupyter, you can create a zip archive by running the Package Assignments.ipynb notebook.

You can create this archive in Bash (or a similar Unix shell) using the following commands.

cd dsc650/assignments
zip -r assignment04_DoeJane.zip assignment04

Likewise, you can create a zip archive using Windows PowerShell with the following command.

Compress-Archive -Path assignment04 -DestinationPath 'assignment04_DoeJane.zip

Discussion

You are required to have a minimum of 10 posts each week. Similar to previous courses, any topic counts towards your discussion count, as long as you are active more than 2 days per week with 10 posts, you will receive full credit.


Last update: March 12, 2023