Image: ESA - C.Carreau (SEMPDN9OY2F)
Overview | Assignments | Lab sessions | Hall of Fame

NETS 212: Scalable and Cloud Computing (Fall 2021)

What is the "cloud"? How do we build software systems and components that scale to millions of users and petabytes of data, and are "always available"?

In the modern Internet, virtually all large Web services run atop multiple geographically distributed data centers: Google, Yahoo, Facebook, iTunes, Amazon, eBay, Bing, etc. Services must scale across thousands of machines, tolerate faults, and support thousands of concurrent requests. Increasingly, the major providers (including Amazon, Google, Microsoft, HP, and IBM) are looking at "hosting" third-party applications in their data centers - forming so-called "cloud computing" services. A significant number of these services also process "streaming" data: geocoding information from cell phones, tweets, streaming video, etc.

This course, aimed at a sophomore with exposure to basic programming within the context of a single machine, focuses on the issues and programming models related to such cloud and distributed data processing technologies: data partitioning, storage schemes, stream processing, and "mostly shared-nothing" parallel algorithms.

NETS212 is a required course for the NETS program and a core requirement for the Data Science Minor. It also counts as a project elective for CSCI and ASCS, and as an Information Systems Elective for SSE.

Instructor

Andreas Haeberlen
Office hour: Mondays 1:00-2:00pm (Zoom)

Teaching assistants

Alex Rand alexrand@seas.upenn.edu OH: Mondays 2:00-3:00pm (5th floor GRW bump space)
Pranav Aurora pranava@seas.upenn.edu OH: Mondays 3:15-4:15pm (5th floor GRW bump space)
Selene Li seleneli@sas.upenn.edu OH: Mondays 5:15-6:15pm (5th floor GRW bump space)
Kevin Chen kevc528@seas.upenn.edu OH: Tuesdays noon-1:00pm (5th floor GRW bump space)
Matthew Jortberg jortberg@seas.upenn.edu OH: Tuesdays 1:45-2:45pm (5th floor GRW bump space)
Maxwell Du maxdu@seas.upenn.edu OH: Wednesdays noon-1:00pm (5th floor GRW bump space)
Divya Somayajula divyas22@seas.upenn.edu OH: Wednesdays 3:30-4:30pm (5th floor GRW bump space)
Silvi Kabra skabra@seas.upenn.edu OH: Thursdays 8:30-9:30am (5th floor GRW bump space)
Jonathan Cheng joncheng@seas.upenn.edu OH: Thursdays 1:45-2:45pm (5th floor GRW bump space)
Jerry Wu jerryzwu@seas.upenn.edu OH: Fridays 11:40am-12:40pm (5th floor GRW bump space)
Philip Kaw ph163k8@seas.upenn.edu OH: Fridays 1:45-2:45pm (Towne 319)
Charles Herrmann crh23@seas.upenn.edu OH: Fridays 3:30-4:30pm (5th floor GRW bump space)

Format

The format will be two 1.5-hour lectures per week, plus assigned readings. There will be regular homework assignments, two midterms, and a term project. We will use Piazza for course-related discussions, and there will be occasional lab sessions.

COVID-19 info

As of July 2021, the plan is to hold the class in person. The University could change this, however, depending on COVID-19 trends and positivity rates on campus and within the surrounding communities. Please keep in mind that Penn currently requires everyone to wear masks while indoors; this includes the lectures, lab sessions, and all office hours.

Time and location

Tuesdays/Thursdays 10:15-11:45am (DRLB A1)

Prerequisites

CIS 120, Introduction to Programming
CIS 160, Discrete Mathematics
Co-requisite: CIS 121, Data Structures*
(* In Fall 2021, NETS212 and CIS121 are in the same time slot. We will work wround this; it is okay to take NETS212 without CIS121.)

Textbooks

Spark: The Definitive Guide, by Bill Chambers and Matei Zaharia (O'Reilly)
ISBN 9781491912218; read online for free, or buy for approx. $54.

Data-Intensive Text Processing with MapReduce, by Jimmy Lin and Chris Dyer (Morgan & Claypool)
ISBN 978-1608453429; read online for free, or buy for approx. $40.

Additional materials will be provided as handouts or in the form of light technical papers.

Grading

Homework 30%, Term project 30%, Exams 35%, Participation/quizzes 5%

Policies

You are encouraged to discuss your homework assignments with your classmates; however, any code you submit must be your own work. You may not share code with others or copy code from outside sources, except where the assignment specifically allows it. Plagiarism can have serious consequences.

Recordings and other materials

At this time, we are not planning to record the lectures. All course materials (slides, handouts, framework code, etc.) are solely for your personal, educational use and may not be shared, copied, or redistributed without permission of the instructor. You are not allowed to make your own recordings of class sessions. Unauthorized sharing or recording is a violation of the Code of Academic Integrity.

Project and awards

The final team project is to build a small Facebook-like application using Node.js and Amazon's DynamoDB. Based on network analysis, the application should make friend recommendations; it should also visualize the social network. In previous years, Facebook and Citadel Securities have sponsored awards for the best term project. You can learn more about the winners from previous years in the Hall of Fame.

Assignments

Homework assignments will be available for download; you can submit your solution here. If necessary, you can request an extension.

Tentative schedule

DateTopicDetailsReadingRemarks
Aug 31 Introduction [Slides] Course introduction
"What is the Cloud, and why is it interesting?"
Data-centric computing
Course goals
Logistics
Policies
Overview of topics
Sep 2Classes canceled due to Hurricane Ida
Sep 7 The Cloud [Slides] What is the Cloud? [Video] [Quiz]
Cloud hardware [Video] [Quiz]
Problems with classical scaling [Video] [Quiz]
Utility computing [Video] [Quiz]
Kinds of clouds [Video] [Quiz]
Virtualization [Video] [Quiz]
Cloud challenges [Video] [Quiz]
Armbrust: A view of cloud computing
Sep 9 Concurrency [Slides] Scalability and parallelization; Amdahl's law [Video] [Quiz]
Synchronization/concurrency/consistency [Video] [Quiz]
Mutual exclusing and locking [Video] [Quiz]
"NUMA, shared-nothing" [Video] [Quiz]
"Frontend/backend, sharding" [Video] [Quiz]
Vogels: Eventually consistent
Sep 14 The Internet [Slides] The Internet; packet switching [Video] [Quiz]
Path properties; TCP [Video] [Quiz]
HW1 overview [Video] [Quiz]
MDN: A re-introduction to JavaScript HW0 due; HW1 released
Sep 14Last day to add
Sep 16 Faults and Failures Fault models [Video] [Quiz]
Examples of non-crash faults [Video] [Quiz]
Replication; durability and availability [Video] [Quiz]
Primary-backup replication [Video] [Quiz]
Quorum replication
Network partitions; CAP theorem
Tseitlin: The antifragile organization
Sep 21 Cloud basics History of cloud computing
Interacting with the cloud
EC2 basics
EBS basics
Overview of some other AWS services
"Cloud computing features, issues, and challenges: a big picture" HW1MS1 due
Sep 23 Cloud storage Key-value stores
KVS and concurrency
KVS and the Cloud
Case study: S3
Case study: DynamoDB
Cooper et al.: PNUTS to Sherpa - Lessons from Yahoo!'s Cloud Database
Sep 28 Spark Introduction to programming for big data and Spark
An example big data problem
Parallelizable operations in Java
Programming in Spark
Key-value pair RDDs in Spark
HW1MS2 due; HW2 released
Sep 30 Programming in Spark Overview of programming in Spark
Spark jobs
A simple Spark job: processing CSV data
Spark jobs with multiple stages
Distributed Spark jobs
Distributed programming considerations
Oct 5 Understanding Spark Overview and midterm reminder
Origins of Spark
Cluster storage for Spark and other big data engines
Using HDFS
The Spark platform
Higher-level Spark
Zaharia et al.: Cluster Computing with Working Sets HW2MS1 due
Oct 7First midterm exam
Oct 11Last day to drop
Oct 12 Graph algorithms Distributed graph algorithms
Distributed graphs
Graph algorithms in Spark
Single-source shortest path
K-Means clustering
Naive Bayes learning
"Lin & Dyer, Chapter 5" HW2MS2 due
Oct 14No class (Fall break)
Oct 19 Random-walk algorithms Random-surfer model
Naive PageRank
Full PageRank
Adsorption / label propagation
HW3 released
Oct 21 Iterative processing Iterative processing
Bulk synchronous parallelism
Pregel and graph processing
Overview of deep neural nets
MXnet
Oct 26 Web programming Web overview
HTML and CSS
Client/server model
The Domain Name System
HTTP and HTTPS
Server design
"Cloudflare: HTTP/3: The past, the present, and the future" Project handout released
Oct 28 Node.js Motivation: CGI and servlets
Node.js; basic operation
Hello world with Node
Accessing data
Cookies and sessions
"Node at LinkedIn: the pursuit of thinner, lighter, faster" HW3 due; HW4 released
Oct 29Last day to designate course as pass/fail
Nov 2 Dynamic content Project overview
Project advice
The Document Object Model
XMLHttpRequest
React: Facebook's Functional Turn on Writing JavaScript HW4MS1 due; team formation deadline
Nov 4 AJAX AJAX overview
AJAX with jQuery
socket.io and async
Working with APIs
Nov 8Last day to withdraw
Nov 9 Web services Web services
Data interchange; challenges
Data formats
Research spotlight: Juneau - Managing & Guiding Data Analytics & Data Science
HW4MS2 due; fir st project check-in
Nov 11 XML XML
Working with XML
DTDs
XML Schema
XML DOM
HW4MS3 due (on Nov 12)
Nov 16 Security Cryptography; RSA
Digital signatures
Attacks and Defenses (Part 1)
Attacks and Defenses (Part 2)
Current OWASP Top 10 Second project check-in
Nov 18 Databases Motivations for databases and data management
"Relational model, data streams"
SQL basics; declarative approach; query optimization
Transactions; ACID
F1: A Distributed SQL Database That Scales
Nov 23 Peer-to-peer Decentralization
Partly centralized systems; BitTorrent
Unstructured overlays; epidemic protocols
Structured overlays; consistent hashing; KBR
Case study: Pastry
Security challenges
Rodrigues and Druschel: P2P systems Third project check-in
Nov 25Thanksgiving - no class
Nov 30 Case study: Bitcoin Distributed ledgers
Bitcoin and Proof-of-Work
Bitcoin Script
Challenges in Bitcoin
Nakamoto: Bitcoin Fourth project check-in
Dec 2 Case study: Facebook Facebook's TAO
Scalability in TAO
Fault handing in TAO
Facebook's Haystack
Haystack design
Bronson et al.: TAO: Facebook's Distributed Data Store for the Social Graph
Dec 7 TBA Topic TBA
Dec 9Second midterm exam
Dec 15 - Dec 22Project demos; written reports due