Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Lab sessions | Hall of Fame

NETS 212: Scalable and Cloud Computing (Fall 2020)

What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

NETS212 is a required course for the NETS program and for the Data Science Minor.

Instructors

Andreas Haeberlen
Office hour: Wednesdays 10:00-11:00am (on Zoom)

Zachary G. Ives
Office hour: Mondays 2:00-3:00pm (via gather.town)

Teaching assistants

Chaim Fishman chaimj@sas.upenn.edu OH: Sundays 10:30-11:30am EST
Sarah Payne paynesa@sas.upenn.edu OH: Mondays 3:00-4:00pm EST
Peter Chen cbaile@seas.upenn.edu OH: Mondays 5:00-6:00pm EST
Joan Shaho jshaho@seas.upenn.edu OH: Tuesdays noon-1:00pm EST
Stefan Papazov spapazov@seas.upenn.edu OH: Tuesdays 6:00-7:00pm EST
Vatsal Jain vatsal99@seas.upenn.edu OH: Wednesdays 1:00-2:00am EST
Alexander Go alexdgo@seas.upenn.edu OH: Wednesdays 3:00-4:00am EST
Anthony Mansur amansur@seas.upenn.edu OH: Wednesdays 4:00-5:00pm EST
Tashweena Heeramun htash@seas.upenn.edu OH: Thursdays 8:00-9:00am EST
Bharath Jaladi bjaladi@seas.upenn.edu OH: Thursdays 11:00am-noon EST
Jamie Wang jamwa@wharton.upenn.edu OH: Thursdays 3:30-4:30pm EST
Lydia Ma malydia@wharton.upenn.edu OH: Fridays 11:00am-noon EST
Vraj Shroff vshroff@sas.upenn.edu OH: Fridays 10:00-11:00am EST

We will be using ohq.io for the TA office hours.

Format

The Fall 2020 version of this class will be entirely online, due to COVID-19. We will make prerecorded lectures available for download, and we will use the class slots for discussion, review, and Q&A. The review sessions will be recorded as well. There will be regular homework assignments, two midterms (online, via GradeScope), and a final team project. We will use Piazza for course-related discussions, and there will be occasional lab sessions.

Time and location

Q&A: Tuesdays 1:30pm EDT (Zoom link)

Prerequisites

CIS 120, Introduction to Programming
CIS 160, Discrete Mathematics
Co-requisite: CIS 121, Data Structures

Textbooks

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia (O'Reilly)
ISBN 9781491912218; read online for free, or buy for approx. $54.

Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer (Morgan & Claypool)
ISBN 978-1608453429; read online for free, or buy for approx. $40.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 35%, Term project 35%, Exams 20%, Participation/quizzes 10%

Policies

You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences.

Recordings and other materials

We will make the recordings from the lectures, Q&A sessions, and labs available on this web page for the duration of this course. These recordings, as well as the other course materials (slides, handouts, framework code) are solely for your personal, educational use and may not be shared, copied, or redistributed without permission of the instructors. You are not allowed to record class sessions yourself. Unauthorized sharing or recording is a violation of the Code of Academic Integrity.

Project and awards

The final team project is to build a small Facebook-like application using Node.js and Amazon's DynamoDB. Based on network analysis, the application should make friend recommendations; it should also visualize the social network. In previous years, Facebook sponsored an award for the best term project. You can learn more about the winners from previous years in the Hall of Fame.

Assignments

Homework assignments will be available for download; you can submit your solution here. If necessary, you can request an extension.

Tentative schedule

DateTopicDetailsReadingRemarks
Sep 1 Introduction [Q&A] Course introduction [Video] [Slides]
What is the Cloud, and why is it interesting? [Video] [Slides]
Data-centric computing [Video] [Slides]
Course goals [Video] [Slides]
Logistics [Video] [Slides]
Policies [Video] [Slides]
Overview of topics [Video] [Slides]
Sep 3 The Cloud What is the Cloud? [Video] [Slides] [Quiz]
Cloud hardware [Video] [Slides] [Quiz]
Problems with classical scaling [Video] [Slides] [Quiz]
Utility computing [Video] [Slides] [Quiz]
Kinds of clouds [Video] [Slides] [Quiz]
Virtualization [Video] [Slides] [Quiz]
Cloud challenges [Video] [Slides] [Quiz]
Armbrust: A view of cloud computing HW0 released
Sep 8 Concurrency [Q&A] Scalability and parallelization; Amdahl's law [Video] [Slides] [Quiz]
Synchronization/concurrency/consistency [Video] [Slides] [Quiz]
Mutual exclusing and locking [Video] [Slides] [Quiz]
NUMA, shared-nothing [Video] [Slides] [Quiz]
Frontend/backend, sharding [Video] [Slides] [Quiz]
Vogels: Eventually consistent
Sep 10 The Internet The Internet; packet switching [Video] [Slides] [Quiz]
Path properties; TCP [Video] [Slides] [Quiz]
HW1 overview [Video] [Slides] [Quiz]
MDN: A re-introduction to JavaScript HW0 due; HW1 released
Sep 15 Faults and Failures [Q&A] Fault models [Video] [Slides] [Quiz]
Examples of non-crash faults [Video] [Slides] [Quiz]
Replication; durability and availability [Video] [Slides] [Quiz]
Primary-backup replication [Video] [Slides] [Quiz]
Quorum replication [Video] [Slides] [Quiz]
Network partitions; CAP theorem [Video] [Slides] [Quiz]
Tseitlin: The antifragile organization
Sep 15Last day to add
Sep 17 Cloud basics History of cloud computing [Video] [Slides] [Quiz]
Interacting with the cloud [Video] [Slides] [Quiz]
EC2 basics [Video] [Slides] [Quiz]
EBS basics [Video] [Slides] [Quiz]
Overview of some other AWS services [Video] [Slides] [Quiz]
Cloud computing features, issues, and challenges: a big picture HW1MS1 due
Sep 22 Cloud storage [Q&A] Key-value stores [Video] [Slides] [Quiz]
KVS and concurrency [Video] [Slides] [Quiz]
KVS and the Cloud [Video] [Slides] [Quiz]
Case study: S3 [Video] [Slides] [Quiz]
Case study: DynamoDB [Video] [Slides] [Quiz]
Cooper et al.: PNUTS to Sherpa - Lessons from Yahoo!'s Cloud Database
Sep 24 Spark Introduction to programming for big data and Spark [Video] [Slides] [Quiz]
An example big data problem [Video] [Slides] [Quiz]
Parallelizable operations in Java [Video] [Slides] [Quiz]
Programming in Spark [Video] [Slides] [Quiz]
Key-value pair RDDs in Spark [Video] [Slides] [Quiz]
HW1MS2 due; HW2 released
Sep 29 Programming in Spark [Q&A] Overview of programming in Spark [Video] [Slides]
Spark jobs [Video] [Slides] [Quiz]
A simple Spark job: processing CSV data [Video] [Slides] [Quiz]
Spark jobs with multiple stages [Video] [Slides] [Quiz]
Distributed Spark jobs [Video] [Slides] [Quiz]
Distributed programming considerations [Video] [Slides] [Quiz]
Oct 1 Understanding Spark Overview and midterm reminder [Video] [Slides]
Origins of Spark [Video] [Slides] [Quiz]
Cluster storage for Spark and other big data engines [Video] [Slides] [Quiz]
Using HDFS [Video] [Slides] [Quiz]
The Spark platform [Video] [Slides] [Quiz]
Higher-level Spark [Video] [Slides] [Quiz]
Zaharia et al.: Cluster Computing with Working Sets
Oct 6 Graph algorithms [Q&A] Distributed graph algorithms [Video] [Slides]
Distributed graphs [Video] [Slides] [Quiz]
Graph algorithms in Spark [Video] [Slides] [Quiz]
Single-source shortest path [Video] [Slides] [Quiz]
K-Means clustering [Video] [Slides] [Quiz]
Naive Bayes learning [Video] [Slides] [Quiz]
Lin & Dyer, Chapter 5 HW2MS1 due
Oct 8First midterm exam
Oct 12Last day to drop
Oct 13 Random-walk algorithms Random-surfer model [Video] [Slides] [Quiz]
Naive PageRank [Video] [Slides] [Quiz]
Full PageRank [Video] [Slides] [Quiz]
Adsorption / label propagation [Video] [Slides] [Quiz]
HW2MS2 due; HW3 released
Oct 15Class canceled
Oct 20 Iterative processing Iterative processing [Video] [Slides]
Bulk synchronous parallelism [Video] [Slides] [Quiz]
Pregel and graph processing [Video] [Slides] [Quiz]
Overview of deep neural nets [Video] [Slides] [Quiz]
MXnet [Video] [Slides] [Quiz]
Oct 22 Web programming Web overview [Video] [Slides] [Quiz]
HTML and CSS [Video] [Slides] [Quiz]
Client/server model [Video] [Slides] [Quiz]
The Domain Name System [Video] [Slides] [Quiz]
HTTP and HTTPS [Video] [Slides] [Quiz]
Server design [Video] [Slides] [Quiz]
Cloudflare: HTTP/3: The past, the present, and the future HW3 due
Oct 27 Node.js Motivation: CGI and servlets [Video] [Slides] [Quiz]
Node.js; basic operation [Video] [Slides] [Quiz]
"Hello world" with Node [Video] [Slides] [Quiz]
Accessing data [Video] [Slides] [Quiz]
Cookies and sessions [Video] [Slides] [Quiz]
Node at LinkedIn: the pursuit of thinner, lighter, faster HW4 released
Oct 29 Dynamic content Project overview [Video] [Slides] [Quiz]
Project advice [Video] [Slides] [Quiz]
The Document Object Model [Video] [Slides] [Quiz]
XMLHttpRequest [Video] [Slides] [Quiz]
React: Facebook's Functional Turn on Writing JavaScript Project handout released
Oct 30Last day to designate course as pass/fail
Nov 2Team formation deadline; project begins; HW4MS1 due
Nov 3Class canceled (Election Day)
Nov 5 AJAX AJAX overview [Video] [Slides] [Quiz]
AJAX with jQuery [Video] [Slides] [Quiz]
socket.io and async [Video] [Slides] [Quiz]
Working with APIs [Video] [Slides] [Quiz]
Nov 9Last day to withdraw
Nov 10 Web services Web services [Video] [Slides] [Quiz]
Data interchange; challenges [Video] [Slides] [Quiz]
Data formats [Video] [Slides] [Quiz]
Research spotlight: Juneau - Managing & Guiding Data Analytics & Data Science [Video] [Slides]
HW4MS2 due
Nov 10First project check-in
Nov 12 XML XML [Video] [Slides] [Quiz]
Working with XML [Video] [Slides] [Quiz]
DTDs [Video] [Slides] [Quiz]
XML Schema [Video] [Slides] [Quiz]
XML DOM [Video] [Slides] [Quiz]
Nov 17 Security Cryptography; RSA [Video] [Slides] [Quiz]
Digital signatures [Video] [Slides] [Quiz]
Attacks and Defenses (Part 1) [Video] [Slides] [Quiz]
Attacks and Defenses (Part 2) [Video] [Slides] [Quiz]
Current OWASP Top 10 HW4MS3 due
Nov 17Second project check-in
Nov 19 Databases Motivations for databases and data management [Video] [Slides] [Quiz]
Relational model, data streams [Video] [Slides] [Quiz]
SQL basics; declarative approach; query optimization [Video] [Slides] [Quiz]
Transactions; ACID [Video] [Slides] [Quiz]
F1: A Distributed SQL Database That Scales
Nov 24 Peer-to-peer Decentralization [Video] [Slides] [Quiz]
Partly centralized systems; BitTorrent [Video] [Slides] [Quiz]
Unstructured overlays; epidemic protocols [Video] [Slides] [Quiz]
Structured overlays; consistent hashing; KBR [Video] [Slides] [Quiz]
Case study: Pastry [Video] [Slides] [Quiz]
Security challenges [Video] [Slides] [Quiz]
Rodrigues and Druschel: P2P systems
Nov 24Third project check-in
Nov 26Thanksgiving - no class
Dec 1 Case study: Bitcoin Distributed ledgers [Video] [Slides] [Quiz]
Bitcoin and Proof-of-Work [Video] [Slides] [Quiz]
Bitcoin Script [Video] [Slides] [Quiz]
Challenges in Bitcoin [Video] [Slides] [Quiz]
Nakamoto: Bitcoin
Dec 1Fourth project check-in
Dec 3 Case study: Facebook Facebook's TAO [Video] [Slides] [Quiz]
Scalability in TAO [Video] [Slides] [Quiz]
Fault handing in TAO [Video] [Slides] [Quiz]
Facebook's Haystack [Video] [Slides] [Quiz]
Haystack design [Video] [Slides] [Quiz]
Bronson et al.: TAO: Facebook's Distributed Data Store for the Social Graph
Dec 8Second midterm exam
Dec 10Monday schedule - no class
Dec 15-22Project demos (via Zoom), written reports due