Resilient Distributed Datasets Note
Question of MapReduce
- lack abstraction for leveraging distributed memory
- iterative algorithm
- interactive data mining
Resilient distributed datasets
- readonly, partition collection of record
- coarse-grained transformations
- lineage
- api
RDD vs. DSM
- recovery is more light weight
- backup tasks like Mapreduce for stragglers
- data locality
Not suitable for RDD
- fine-grained updates application
Spark programming interface
driver
with cluster of =workers
Example
// parse text into point, then persist it to RAM
val points = spark.textFile(...)
.map(parsePoint).persist()
// repeatedly run map and reduce on points to compute gradient
var w = // random initial vector
for (i <- 1 to ITERATIONS) {
val gradient = points.map{ p =>
p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y
}.reduce((a,b) => a+b)
w -= gradient
}
// get links and persist
val links = spark.textFile(...).map(...).persist()
// initail RDD
var ranks = // RDD of (URL, rank) pairs
// repeatedly compute the new RDD from old RDD
for (i <- 1 to ITERATIONS) {
// Build an RDD of (targetURL, float) pairs
// with the contributions sent by each page
val contribs = links.join(ranks).flatMap {
(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
// Sum contributions by URL and get new ranks
ranks = contribs.reduceByKey((x,y) => x+y)
.mapValues(sum => a/N + (1-a)*sum)
}
Representing RDDs
- dependencies
- narrow dependencies
- wide dependencies
- partitions
- oerferredLocations
- iterator
- partitioner
Implementation detail
- Job scheduling
- stages with per tasks
- interpreter integration
- class shipping
- modified code generation
- memory management
- LRU eviction
- priority
- checkpointing
- for long-run linegae chains
Expressing exising programming models in spark
- mapreduce
- SQL
- Iterative mapreduce
- Batched Stream Processing