Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Lab sessions | Hall of Fame

NETS 212: Scalable and Cloud Computing (Fall 2021)

What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

NETS212 is a required course for the NETS program and a core requirement for the Data Science Minor. It also counts as a project elective for CSCI and ASCS, and as an Information Systems Elective for SSE.

Instructor

Andreas Haeberlen
Office hour: Mondays 1:00-2:00pm (Zoom)

Teaching assistants

Alex Rand alexrand@seas.upenn.edu OH: Mondays 2:00-3:00pm (5th floor GRW bump space)
Pranav Aurora pranava@seas.upenn.edu OH: Mondays 3:15-4:15pm (5th floor GRW bump space)
Selene Li seleneli@sas.upenn.edu OH: Mondays 5:15-6:15pm (5th floor GRW bump space)
Kevin Chen kevc528@seas.upenn.edu OH: Tuesdays noon-1:00pm (5th floor GRW bump space)
Matthew Jortberg jortberg@seas.upenn.edu OH: Tuesdays 1:45-2:45pm (5th floor GRW bump space)
Maxwell Du maxdu@seas.upenn.edu OH: Wednesdays noon-1:00pm (5th floor GRW bump space)
Divya Somayajula divyas22@seas.upenn.edu OH: Wednesdays 3:30-4:30pm (5th floor GRW bump space)
Silvi Kabra skabra@seas.upenn.edu OH: Thursdays 8:30-9:30am (5th floor GRW bump space)
Jonathan Cheng joncheng@seas.upenn.edu OH: Thursdays 1:45-2:45pm (5th floor GRW bump space)
Jerry Wu jerryzwu@seas.upenn.edu OH: Fridays 11:40am-12:40pm (5th floor GRW bump space)
Philip Chea ph163k8@seas.upenn.edu OH: Fridays 1:45-2:45pm (Virtual)
Charles Herrmann crh23@seas.upenn.edu OH: Fridays 3:30-4:30pm (5th floor GRW bump space)

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two midterms, and a term project. We will use Piazza for course-related discussions, and there will be occasional lab sessions.

COVID-19 info

As of July 2021, the plan is to hold the class in person. The University could change this, however, depending on COVID-19 trends and positivity rates on campus and within the surrounding communities. Please keep in mind that Penn currently requires everyone to wear masks while indoors; this includes the lectures, lab sessions, and all office hours.

Time and location

Tuesdays/Thursdays 10:15-11:45am (DRLB A1)

Prerequisites

CIS 120, Introduction to Programming
CIS 160, Discrete Mathematics
Co-requisite: CIS 121, Data Structures*
(* In Fall 2021, NETS212 and CIS121 are in the same time slot. We will work wround this; it is okay to take NETS212 without CIS121.)

Textbooks

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia (O'Reilly)
ISBN 9781491912218; read online for free, or buy for approx. $54.

Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer (Morgan & Claypool)
ISBN 978-1608453429; read online for free, or buy for approx. $40.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 30%, Term project 30%, Exams 35%, Participation/quizzes 5%

Policies

You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences.

Recordings and other materials

At this time, we are not planning to record the lectures. All course materials (slides, handouts, framework code, etc.) are solely for your personal, educational use and may not be shared, copied, or redistributed without permission of the instructor. You are not allowed to make your own recordings of class sessions. Unauthorized sharing or recording is a violation of the Code of Academic Integrity.

Project and awards

The final team project is to build a small Facebook-like application using Node.js and Amazon's DynamoDB. Based on network analysis, the application should make friend recommendations; it should also visualize the social network. In previous years, Facebook and Citadel Securities have sponsored awards for the best term project. You can learn more about the winners from previous years in the Hall of Fame.

Assignments

Homework assignments will be available for download; you can submit your solution here. If necessary, you can request an extension.

Tentative schedule

DateTopicDetailsReadingRemarks
Aug 31 Introduction [Slides] Course introduction
"What is the Cloud, and why is it interesting?"
Data-centric computing
Course goals
Logistics
Policies
Overview of topics
Sep 2Classes canceled due to Hurricane Ida
Sep 7 The Cloud [Slides] What is the Cloud? [Video] [Quiz]
Cloud hardware [Video] [Quiz]
Problems with classical scaling [Video] [Quiz]
Utility computing [Video] [Quiz]
Kinds of clouds [Video] [Quiz]
Virtualization [Video] [Quiz]
Cloud challenges [Video] [Quiz]
Armbrust: A view of cloud computing
Sep 9 Concurrency [Slides] Scalability and parallelization; Amdahl's law [Video] [Quiz]
Synchronization/concurrency/consistency [Video] [Quiz]
Mutual exclusing and locking [Video] [Quiz]
"NUMA, shared-nothing" [Video] [Quiz]
"Frontend/backend, sharding" [Video] [Quiz]
Vogels: Eventually consistent
Sep 14 The Internet [Slides] The Internet; packet switching [Video] [Quiz]
Path properties; TCP [Video] [Quiz]
HW1 overview [Video] [Quiz]
MDN: A re-introduction to JavaScript HW0 due; HW1 released
Sep 14Last day to add
Sep 16 Faults and Failures [Slides] Fault models [Video] [Quiz]
Examples of non-crash faults [Video] [Quiz]
Replication; durability and availability [Video] [Quiz]
Primary-backup replication [Video] [Quiz]
Quorum replication [Video] [Quiz]
Network partitions; CAP theorem [Video] [Quiz]
Tseitlin: The antifragile organization
Sep 21 Cloud basics [Slides] History of cloud computing [Video] [Quiz]
Interacting with the cloud [Video] [Quiz]
EC2 basics [Video] [Quiz]
EBS basics [Video] [Quiz]
Overview of some other AWS services [Video] [Quiz]
"Cloud computing features, issues, and challenges: a big picture" HW1MS1 due
Sep 23 Cloud storage [Slides] Key-value stores [Video] [Quiz]
KVS and concurrency [Video] [Quiz]
KVS and the Cloud [Video] [Quiz]
Case study: S3 [Video] [Quiz]
Case study: DynamoDB [Video] [Quiz]
Cooper et al.: PNUTS to Sherpa - Lessons from Yahoo!'s Cloud Database
Sep 28 Spark [Slides] Introduction to programming for big data and Spark [Video] [Quiz]
An example big data problem [Video] [Quiz]
Parallelizable operations in Java [Video] [Quiz]
Programming in Spark [Video] [Quiz]
Key-value pair RDDs in Spark [Video] [Quiz]
"Spark textbook, Chapter 2 and 3" HW1MS2 due; HW2 released
Sep 30 Programming in Spark [Slides] Overview of programming in Spark [Video] [Quiz]
Spark jobs [Video] [Quiz]
A simple Spark job: processing CSV data [Video] [Quiz]
Spark jobs with multiple stages [Video] [Quiz]
Distributed Spark jobs [Video] [Quiz]
Distributed programming considerations [Video] [Quiz]
"Spark textbook, Chapters 4-8"
Oct 5 Understanding Spark [Slides] Overview and midterm reminder [Video] [Quiz]
Origins of Spark [Video] [Quiz]
Cluster storage for Spark and other big data engines [Video] [Quiz]
Using HDFS [Video] [Quiz]
The Spark platform [Video]
Higher-level Spark [Video] [Quiz]
Zaharia et al.: Cluster Computing with Working Sets HW2MS1 due
Oct 7First midterm exam
Oct 11Last day to drop
Oct 12 Graph algorithms [Slides] Distributed graph algorithms [Video]
Distributed graphs [Video] [Quiz]
Graph algorithms in Spark [Video] [Quiz]
Single-source shortest path [Video] [Quiz]
K-Means clustering [Video] [Quiz]
Naive Bayes learning [Video] [Quiz]
"Lin & Dyer, Chapter 5" HW2MS2 due
Oct 14No class (Fall break)
Oct 19 Random-walk algorithms [Slides] Random-surfer model [Video] [Quiz]
Naive PageRank [Video] [Quiz]
Full PageRank [Video] [Quiz]
Adsorption / label propagation [Video] [Quiz]
Baluja et al.: Video Suggestion and Discovery for YouTube HW3 released
Oct 21 Iterative processing [Slides] Iterative processing [Video]
Bulk synchronous parallelism [Video] [Quiz]
Pregel and graph processing [Video] [Quiz]
Overview of deep neural nets [Video] [Quiz]
MXnet [Video] [Quiz]
Malewicz et al.: 'Pregel - A System for Large-Scale Graph Processing'
Oct 26 Web programming [Slides] Web overview [Video] [Quiz]
HTML and CSS [Video] [Quiz]
Client/server model [Video] [Quiz]
The Domain Name System [Video] [Quiz]
HTTP and HTTPS [Video] [Quiz]
Server design [Video] [Quiz]
"Cloudflare: HTTP/3: The past, the present, and the future" Project handout released
Oct 28 Node.js [Slides] Motivation: CGI and servlets [Video] [Quiz]
Node.js; basic operation [Video] [Quiz]
Hello world with Node [Video] [Quiz]
Accessing data [Video] [Quiz]
Cookies and sessions [Video] [Quiz]
"Node at LinkedIn: the pursuit of thinner, lighter, faster" HW3 due; HW4 released
Oct 29Last day to designate course as pass/fail
Nov 2 Dynamic content [Slides] Project overview [Video] [Quiz]
Project advice [Video] [Quiz]
The Document Object Model [Video] [Quiz]
XMLHttpRequest [Video] [Quiz]
React: Facebook's Functional Turn on Writing JavaScript Team formation deadline
Nov 4 AJAX [Slides] AJAX overview [Video] [Quiz]
AJAX with jQuery [Video] [Quiz]
socket.io and async [Video] [Quiz]
Working with APIs [Video] [Quiz]
HW4MS1 due (on Nov 5)
Nov 8Last day to withdraw
Nov 9 Web services and XML [Slides] Web services [Video] [Quiz]
Data interchange; challenges [Video] [Quiz]
Data formats [Video] [Quiz]
XML [Video] [Quiz]
Working with XML [Video] [Quiz]
DTDs [Video] [Quiz]
XML Schema [Video] [Quiz]
XML DOM [Video] [Quiz]
First project check-in
Nov 11 Security [Slides] Cryptography; RSA [Video] [Quiz]
Digital signatures [Video] [Quiz]
Attacks and Defenses (Part 1) [Video] [Quiz]
Attacks and Defenses (Part 2) [Video] [Quiz]
Current OWASP Top 10 HW4MS2 due
Nov 16 Databases [Slides] Motivations for databases and data management [Video] [Quiz]
"Relational model, data streams" [Video] [Quiz]
SQL basics; declarative approach; query optimization [Video] [Quiz]
Transactions; ACID [Video] [Quiz]
F1: A Distributed SQL Database That Scales HW4MS3 due; second project check-in
Nov 18 Peer-to-peer [Slides] Decentralization [Video] [Quiz]
Partly centralized systems; BitTorrent [Video] [Quiz]
Unstructured overlays; epidemic protocols [Video] [Quiz]
Structured overlays; consistent hashing; KBR [Video] [Quiz]
Case study: Pastry [Video] [Quiz]
Security challenges [Video] [Quiz]
Rodrigues and Druschel: P2P systems
Nov 23 Special topics Accountability
Secure provenance and forensics
A Case for the Accountable Cloud Third project check-in
Nov 25Thanksgiving - no class
Nov 30 Case study: Bitcoin Distributed ledgers
Bitcoin and Proof-of-Work
Bitcoin Script
Challenges in Bitcoin
Nakamoto: Bitcoin Fourth project check-in
Dec 2 Case study: Facebook Facebook's TAO
Scalability in TAO
Fault handing in TAO
Facebook's Haystack
Haystack design
Bronson et al.: TAO: Facebook's Distributed Data Store for the Social Graph
Dec 7 Special topics Differential privacy
Federated analytics
Differential Privacy: The Pursuit of Protections by Default
Dec 9Second midterm exam
Dec 15 - Dec 22Project demos; written reports due