Resilient Distributed Datasets Note

Question of MapReduce

lack abstraction for leveraging distributed memory
1. iterative algorithm
2. interactive data mining

Resilient distributed datasets

readonly, partition collection of record
coarse-grained transformations
lineage
api

RDD vs. DSM

recovery is more light weight
backup tasks like Mapreduce for stragglers
data locality

Not suitable for RDD

fine-grained updates application

Spark programming interface

driver with cluster of =workers

Example

Logistic Regression

// parse text into point, then persist it to RAM
val points = spark.textFile(...)
  .map(parsePoint).persist()

// repeatedly run map and reduce on points to compute gradient
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>
  p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)
  w -= gradient
}

PageRank

// get links and persist
val links = spark.textFile(...).map(...).persist()
// initail RDD
var ranks = // RDD of (URL, rank) pairs

// repeatedly compute the new RDD from old RDD
for (i <- 1 to ITERATIONS) {
  // Build an RDD of (targetURL, float) pairs
  // with the contributions sent by each page
  val contribs = links.join(ranks).flatMap {
    (url, (links, rank)) =>
    links.map(dest => (dest, rank/links.size))
  }
  // Sum contributions by URL and get new ranks
  ranks = contribs.reduceByKey((x,y) => x+y)
    .mapValues(sum => a/N + (1-a)*sum)
}

Representing RDDs

dependencies
1. narrow dependencies
2. wide dependencies
partitions
oerferredLocations
iterator
partitioner

Implementation detail

Job scheduling
1. stages with per tasks
interpreter integration
1. class shipping
2. modified code generation
memory management
1. LRU eviction
2. priority
checkpointing
1. for long-run linegae chains

Expressing exising programming models in spark

mapreduce
SQL
Iterative mapreduce
Batched Stream Processing