The Google File System

Different point against traditional distributed file system

component failures is normal
files are huge
most files are mutated by appending new data rather than overwriting
flexibale API

Assumptions for design overview

often fail
large files
workloads
1. large streaming reads
2. small random reads
3. large, sequential writes
4. modify is rare
semantics for multiple clients concurrently append
High bandwith than low latency

API

usual operation
snapshot
1. COW
record append

Architecture

single master, multiple chunkservers accessed by multiple clients
files are divided into fixed-size chunks, identified by an unique chunk handle, also replicated
1. Chunk size: 64MB
master maintaains all file system metadata
1. namespace
2. access control information
3. mapping from files to chunks
4. current location of chunks
5. chunk lease management
6. gc
7. chunk migration
8. Heart beats with chunkservers
Not POSIX API
master just transmit metadata, all data-bearing communication goes to chunkservers
No cache for file data

gfs-architecture

Metadata in master

Three major type
1. namespace(persisted, replicated)
2. mapping from files to chunks(persisted, replicated)
3. location of each chun\’s replicas(ask for information)
All in memory
chunk location
1. poll at start

Operation log

Consistency Model

Weak Consistency
Strong Consistency

gfs-consisteny

implicaiton for application level

atomically rename
checkpoint
checksum and unqiue identifier for padding and rare depulication

Leases

primary, one of the replications

gfs-dataflow

Atomic Record Appends

successful record append is defined while intervening regions are undefined
at-least-once, with predictable magic number and unique IDs

Master Operation

Namespace and locking if it involves d1/d2…/dn/leaf, it will acquire read-locks on the

directory names /d1, /d1/d2, …, d1/d2…/dn, and either a read lockor a write lockon the full pathname d1/d2…/dn/leaf

allows concurrent mutations in the same directory, multiple file creations can be executed concurrently in the same directory: each acquires a read lockon the directory name and a write lockon the file name

replica Plcaement
Creation, re-replica, rebalancing
1. limit the number of “recent” creations on each chunkserver
2. place new replicas on chunkservers with below-average diskspace utilization
3. spread replicas of a chunkacross racks
GC
1. master logs the deletion
2. file is just renamed to a hidden name that includes the deletion timestamp
3. During the master’s regular scan of the file system namespace, it removes any such hidden files if they have existed for more than three days

fault tolerent

fast recovery
master/chunk replications
1. master shadow replica
checksum for broken chunk