Data-Intensive Distributed Computing

Part	Description	Dates	Related Assignment
1	Introduction to Big Data	Jan. 9th	CS451 - A0 CS431 - A0
2	MapReduce Algorithm Design	Jan. 11, 16, 18	CS451 - A1 CS431 - A1
3	From MapReduce to Spark	Jan. 23, 25, 30	CS451 - A2 CS431 - A2
4	Analyzing Text	Feb. 1, 6	CS451 - A3
5	Analyzing Graphs	Feb. 8, 13, 15	CS451 - A4 CS431 - A3
	Reading Week!	Feb. 17-25	-
6	Data Mining and Machine Learning	Feb. 27, 29, Mar. 5, 7	CS451 - A5 CS431 - A4
7	Analyzing Relational Data	Mar. 12, 14🎉, 19	CS451 - A6 CS431 - A5
8	Real-Time Analytics (Streaming)	Mar. 21, 26	CS451 - A7 CS431 - A6
9	Mutable State (Big Table / HBase)	Mar. 28, Apr 2	-
10	Analyzing Graphs, Redux (Giraph, Spark GraphX)	Apr. 4	-

Note that the following slides are from last term. When I have time I will be tweaking them. There's some Javascript that puts an "updated" note beside any files that change.

Part 1: Introduction to Big Data

Topics

What's this course about?
Why big data?
Scaling models

Slides

PDF Module 1 - Introduction

Part 2: MapReduce Algorithm Design

Topics

MapReduce programming model
Cloud computing and datacenters
Hadoop API
Hadoop physical execution
MapReduce design patterns
Intermediate aggregation and combiners
Partitioning, grouping, and sorting

Readings

Data-Intensive Text Processing with MapReduce
Hadoop: The Definitive Guide (4th Edition):
- Chapter 1: Meet Hadoop
- Chapter 2: MapReduce
- Chapter 3: The Hadoop Distributed Filesystem (Focus on the mechanics of the HDFS commands and don't worry so much about learning the Java API all at once—you'll pick it up in time.)
- Chapter 5: Hadoop I/O (Read sections "Serialization" and "File-Based Data Structures")
- Chapter 6: Developing a MapReduce Application (Skip sections "Setting Up the Development Environment", "Writing a Unit Test with MRUnit" and "MapReduce Workflows")
- Chapter 7: How MapReduce Works (Skip section on "Configuration Tuning")
- Chapter 8: MapReduce Types and Formats
- Chapter 9: MapReduce Features (Read sections on "Counters", "Sorting", and "Side Data distribution")

Slides

PDF Module 2 - MapReduce

Part 3: From MapReduce to Spark

Topics

Evolution of dataflow abstractions
MapReduce, Pig, Spark, etc.

Readings

Learning Spark (Optional):
- Chapter 1: Introduction to Data Analysis with Spark
- Chapter 2: Downloading Spark and Getting Started (Skip section on downloading)
- Chapter 3: Programming with RDDs
- Chapter 4: Working with Key/Value Pairs
- Chapter 5: Loading and Saving Your Data (Stop when you get to Structured Data with Spark SQL)

Slides

Part 4: Analyzing Text

Topics

Language models and machine translation
Inverted indexing and search

Readings

Large Language Models in Machine Translation (Optional)
Data-Intensive Text Processing with MapReduce — Chapter 4: Inverted Indexing for Text Retrieval

Slides

PDF Module 4 - Text Processing

Part 5: Analyzing Graphs

Topics

Graph representations
Parallel breadth-first search
PageRank and random walks
Issues and challenges with dataflow abstractions

Readings

Data-Intensive Text Processing with MapReduce — Chapter 5: Graph Algorithms

Slides

PDF Module 5 - Graphs

Part 6: Data Mining and Machine Learning

Topics

Supervised machine learning: binary classification
Logistic regression, gradient descent, stochastic gradient descent, ensemble methods
Production machine learning pipelines
Hashing: minhash
Clustering: k-means

Readings

Tom Mitchell. Naive Bayes and Logistic Regression. (This book chapter serves as supplemental reading and goes into classification in more detail than in lecture.)
Deisenroth et al., Mathematics for Machine Learning: Chapter 12, Classification with Support Vector Machines. (Optional supplemental reading)
Deisenroth et al., Mathematics for Machine Learning: Chapter 11, Density Estimation with Gaussian Mixture Models. (This book chapter serves as supplemental reading and goes into clustering with Gaussian mixture models in more detail than in lecture.)

Slides

Part 7: Analyzing Relational Data

Topics

OLTP vs. OLAP
Data warehousing and data lakes, ETL
SQL-on-Hadoop: relational data processing with MapReduce and Spark
Optimizations for relational processing: row vs. column stores, vectorized processing
Semistructured data and record reconstruction (Parquet)

Readings

Data-Intensive Text Processing with MapReduce — Chapter 6: Processing Relational Data
MapReduce: A major step backwards

Slides

PDF Module 7 - Relational Data

Part 8: Real-Time Analytics

Topics

Stream processing semantics, issues, and frameworks
Introduction to Apache Spark Streaming
Probabilistic data structures (hyerloglog counters, bloom filters, count-min sketches, etc.)

Readings

Zaharia et al. Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013.

Slides

PDF Module 8 - Beyond Batch Processing

Part 9: Mutable State

Topics

Bigtable/HBase: Log-structure merge trees
Distributed hash tables
Consistency, latency, and availability tradeoffs

Readings

The original Bigtable paper.
The original DHT paper.
Daniel Abadi. Consistency Tradeoffs in Modern Distributed Database System Design, Computer, 45(2):37-42, 2012.

Slides

PDF Module 9 - Mutable State

Part 10: Analyzing Graphs, Redux

Topics

Bulk synchronous parallel: "think like a vertex" (Giraph)

Readings

Mining of Massive Datasets: Link Analysis Section 5.4
Sherif Sakr. Large-Scale Graph Processing Systems, 2016.

Slides

PDF Module 10 - Graphs Redux

Syllabus Data-Intensive Distributed Computing (Winter 2024)

Schedule

Part 1: Introduction to Big Data

Topics

Slides

Part 2: MapReduce Algorithm Design

Topics

Readings

Slides

Part 3: From MapReduce to Spark

Topics

Readings

Slides

Part 4: Analyzing Text

Topics

Readings

Slides

Part 5: Analyzing Graphs

Topics

Readings

Slides

Part 6: Data Mining and Machine Learning

Topics

Readings

Slides

Part 7: Analyzing Relational Data

Topics

Readings

Slides

Part 8: Real-Time Analytics

Topics

Readings

Slides

Part 9: Mutable State

Topics

Readings

Slides

Part 10: Analyzing Graphs, Redux

Topics

Readings

Slides

Syllabus
Data-Intensive Distributed Computing (Winter 2024)