Week 9¶
In this lesson, we will learn about real-time data streams, message systems, and transactions in distributed systems.
Objectives¶
After completing this week, you should be able to:
- Implement scalable stream processing in Spark
- Explain different approaches to transactions in distributed systems and the associated trade-offs
Readings¶
- Read chapter 11 in Designing Data-Intensive Applications
- (Optional) Read chapters 8 in Designing Data-Intensive Applications
Weekly Resources¶
- etcd
- Kafka Use Cases
- Kafka Introduction
- The Log: What every software engineer should know about real-time data's unifying abstraction
- Representational State Transfer (REST)
- Spark Structured Streaming
- Zookeeper
Assignment 9¶
In the second part of the exercise, you will create two streaming dataframes using the accelerations
and locations
folders.
Assignment 9.1¶
Start by creating a simple Spark Streaming application that reads data from the accelerations
and locations
topics and uses the Kafka sink to save the results to the LastnameFirstname-simple
topic.
Assignment 9.2¶
Define a watermark on the locations dataframe using the timestamp
column. Set the threshold for the watermark at "30 seconds". Set a window of "15 seconds" and compute the mean speed of each ride defined by the ride_id
. Save the results in LastnameFirstname-windowed
and set the output mode to update
.
Assignment 9.3¶
Join the two streams together on the ride_id
as an inner join. Save the results in LastnameFirstname-joined
.
Submission Instructions¶
For this assignment, you will submit a zip archive containing the contents of the dsc650/assignments/assignment09/
directory. Use the naming convention of assignment09_LastnameFirstname.zip
for the zip archive. You can create this archive in Bash (or a similar Unix shell) using the following commands.
cd dsc650/assignments
zip -r assignment09_DoeJane.zip assignment08
Likewise, you can create a zip archive using Windows PowerShell with the following command.
Compress-Archive -Path assignment09 -DestinationPath 'assignment09_DoeJane.zip
Discussion Board¶
You are required to have a minimum of 10 posts each week. Similar to previous courses, any topic counts towards your discussion count, as long as you are active more than 2 days per week with 10 posts, you will receive full credit. Refer to the optional topic below as a starting place.
Describe how different database systems handle transactions. Pick three or more different systems to compare and contrast.