Lecture 4: Consistency

Outline
  Consistency
  Consistency models
    Strict consistency
    Sequential consistency
  IVY

Consistency = a constraint on the memory model of a storage system
  : effect of concurrent read and writes
  Interesting: when data is replicated and concurrent access

Replicated data an important topic in distributed systems
  For performance and fault tolerance

  Examples:
    caching for low latency or reduced server load
    replicate over multiple servers for low latency or increased throughput
    replicate for fault tolerance

  How do we know if caching/replication is correct?
  We need to know how to think about correct execution of distributed programs.
  Most of these ideas from multiprocessors and databases 20/30 years ago.
  For now, just correctness and efficiency, not fault-tolerance.

Replicating content that isn't modified (or has a single writer) is "simple"
  Immutable objects: printed newspaper
  Single writer: Web browser caching
    Browser obeys HTTP expiration directive and last-modified

Topic gets much more interesting when concurrent reads and writes.
  Let's assume we are implementing a traditional memory (i.e. with LD/ST)
    matches many read/write abstract interfaces (e.g., file systems)

Naive distributed memory
  [diagram]
  M0, M1, M2, LAN
  each machine has a local copy of all of memory
  read: from local memory
  write: send update msg to each other host (but don't wait)
  fast: never waits for communication
  Does this memory work well?
   
Example 1:
  M0:
    v0 = f0();
    done0 = 1;
  M1:
    while(done0 == 0)
      ;
    v1 = f1(v0);          
    done1 = 1;
  M2:
    while(done1 == 0)
      ;
    v2 = f2(v0, v1);

  Intuitive intent:
    M2 should execute f2() with results from M0 and M1
    waiting for M1 implies waiting for M0

Example 1 won't work with naive distributed memory:

Problem A:
  [time diagram]
  M0's writes of v0 and done0 may be interchanged by network
  leaving v0 unset but done0=1
  how to fix? would lab RPC fix? 
  -> yes, if machines don't execute next write until previous write is acked 
     by all other machines.

Problem B:
  [time diagram]
  M2 sees M1's writes before M0's writes
  i.e., M2 and M1 disagree on order of M0 and M1 writes
  how to fix? solution from above will do for this simple program
    (would not be so easy if multiple machines would write to the same variable)
  

Naive distributed memory is fast but has unexpected behavior
  maybe it isn't "correct"
  maybe we should never have expected Example 1 to work

How can we write correct distributed programs w/ shared storage?
  Memory system promises to behave according to certain rules.
  We write programs assuming those rules.
  Rules are a "consistency model"
  Contract between memory system and programmer

What makes a good consistency model?
  There are no "right" or "wrong" models
  A model may make it harder or easier to program
    i.e. lead to more or less intuitive results
  A model may be harder or easier to implement efficiently
  Also application dependent
    A consistency model for Web pages different than for memory

How about "strict consistency":
  Rule 1: LD gets value of most recent previous ST to same address
  Rule 2: (implied) LD/ST instructions across all machines get a unique time stamp from a global clock  
  Essentially the same as on uniprocessor 
  Very intuitive consistency model

Example:
M1:        ST(x)1
-------------------------------
M2:  LD(x)0       LD(x)1 LD(x)1

valid under strict consistency!

Example:
M1:  ST(x)1
-----------------------
M2:        LD(x)0 LD(x)1

Not valid under strict consistency!

Would strict consistency avoid problem A and B?

How do you implement strict consistency?
  Time:  1  2  3  4
  M0:    ST       ST
  M1:       LD LD
  Time between instructions << speed-of-light between machines!
  How is LD@2 even aware of ST@1?
  How does ST@4 know to pause until LD@3 has finished?
    how does ST@4 know how long to wait?
  Too hard to implement!

A more reasonable model: sequential consistency
  Rule 1. result must be the same as if all machines' LD/ST instructions occured in some total order
  Rule 2. the LD/ST instruction executed on each inividual machine appear in the order specificed by the program.

Example:
M1:  ST(x)1
-----------------------
M2:        LD(x)1 LD(x)2
-----------------------
M3:        LD(x)1 LD(x)2
-----------------------
M4:  ST(x)2

Valid under sequential consistency (but not under strict consistency)

Example:
M1:  ST(x)1
-----------------------
M2:        LD(x)1 LD(x)2
-----------------------
M3:        LD(x)2 LD(x)1
-----------------------
M4:  ST(x)2

Not valid under sequential consistency!

Example:
M1:        ST(x)1
-----------------------
M2:  LD(x)1

Valid under sequential consistency, even though it doesn't respect causality!


A sequentially consistent system would not have Problems A/B
  Problem A
    M0's execution order was v0= done0=
    M1 saw done0= v0=
    each machine's operations must appear in execution order
    so cannot happen w/ sequential consistency
  Problem B
    M1 saw v0= done0= done1=
    M2 saw done1= v0=
    this cannot occur given a single total order
    so cannot happen w/ sequential consistency

Better performance than strict consistency
  System has some freedom in how it interleaves different machines' ops
  not forced to order by op start time according to a global clock, as in strict consistency
  system can delay a read or write while it finds current values

Performance is still not great
  Once a machine's write completes, other machines' reads must see new data
  Thus communication cannot be omitted or much delayed
  Thus either reads or writes (or both) will be expensive

A simple implementation of sequential consistency
  [diagram]
  single memory server
  each machine sends r/w ops to server, in order, waiting for reply
  server picks order among waiting ops
  server executes one by one, sending replies
  
This simple implementation will be slow
  single server will get overloaded
  no local cache, so all operations wait for server

Idea 1:
  partition memory across multiple servers
  eliminates single-server bottleneck
  can serve many machines in parallel if they don't use same memory
  Lamport paper from 1979 shows system is seq consistent if:
    1. each machine executes one LD/ST at a time, waiting for it to complete
    2. LD/ST ops to the same memory location are executed in FIFO order
  i.e. you can have lots of independent machines and memory systems
    
Idea 2:
  if a memory location is not written, you can replicate it
  i.e. cache it on each machine, so reads are fast
  but must ensure reads and writes are ordered
    once the write modifies the location, no read should return old value
    thus must revoke cached copies before writing
    this delays writes to improve read performance

which brings us to IVY, which uses both ideas, and more
  IVY = Integrated  shared  Virtual  memory  at Yale
  Memory Coherence in Shared Virtual Memory Systems, Li and Hudak, PODC 1986

Why is IVY cool?
  Acts like an expensive shared-memory multiprocessor
  On a network of cheap machines
  [diagram: LAN, machines w/ RAM, MGR]
  Runs threaded code w/o modification
    e.g. matrix multiply, physical simulation, sort
  Lab 5/6 is closely related to IVY, though much simpler

IVY big picture
  [diagram: M0+pagedMem, M1+pagedMem, LAN]
  Operates on pages of memory, stored in machine DRAM (no mem server)
  Uses VM hardware to intercept reads/writes

Simplified IVY:
  Only one copy of a page at a time (on one machine)
  All other copies marked invalid in VM tables
  If M0 faults (read or write):
    Find the one copy -- e.g. in M1
    Tell M1 to invalidate it in VM tables
    Fetch the page contents from M1
    M0 marks the page read+write in VM tables
  Provides sequential consistency:
    Order of reads/writes set by order in which page moves
  Slow: what if a page is read by many machines, never written?
    Have to fault + send page for every read
  
IVY allows multiple reader copies between writes
  No need to force an order for reads that occur between two writes
  Let them occur concurrently -- a copy of the page at each reader

Thus IVY's core strategy:
  Either
    * multiple read-only copies and no writeable copies, or
    * one writeable copy, no other copies
  => Before write, invalidate all other copies
  => Must track one writer (owner) and copies (copy_set)

Why crucial to invalidate all copies before write?
  Once a write completes, all subsequent reads *must* see new data
  If one could read stale data, this could occur:
  M0:  ST(v)0      St(v)99  ST(done)1
  --------------------------------------------
  M1:       LD(v)0                    LD(done)1  LD(v)0
  But we know this must not happen with sequential consistency


Message types:
  [don't list these on board, just for reference]
  RQ read query (reader to MGR)
  RF read forward (MGR to owner)
  RD read data (owner to reader)
  RC read confirm (reader to MGR)
  WQ
  IV
  IC
  WF
  WD
  WC


==========================
This is a copy of the code in Section 3.1 of Li and Hudak's Memory
Coherence in Shared Virtual Memory Systems (1986), with some bugs
fixed. We've stripped out the code for the case in which the manager
takes faults -- in this version, the manager does not run application
code. Messages are delivered reliably. There are no failures.


Each node has
struct {
  Lock lock;
  Enum access{nil, read, write};
} ptable[NUM_PAGES];

Manages also has
struct {
  Lock lock;
  Int owner;
  Bitmap copy_set[NUM_PROCS];
} info[NUM_PAGES];

ReadFaultHandler(PageNumber p):
  lock(ptable[p].lock)
  ask manager for read access to p
  wait for someone to send me p's content
  ptable[p].access = read
  send confirmation to manager
  unlock(ptable[p].lock)

ReadServer(PageNumber p, MachineID request_node):
  lock(ptable[p].lock)
  if I am owner of p:
    ptable[p].access = read
    send copy of p to request_node
  unlock(ptable[p].lock)

  if I am manager:
    lock(info[p].lock)
    info[p].copy_set |= request_node
    ask info[p].owner to send copy of p to request_node
    wait for confirmation from request_node
    unlock(info[p].lock)

WriteFaultHandler(PageNumber p):
  lock(ptable[p].lock)
  ask manager for write access to p
  wait for someone to send me p's content
  ptable[p].access = write
  send confirmation to manager
  unlock(ptable[p].lock)

WriteServer(PageNumber p, MachineID request_node):
  lock(ptable[p].lock)
  if I am owner of p:
    send copy of p to request_node
    ptable[p].access = nil
  unlock(ptable[p].lock)

  if I am manager:
    lock(info[p].lock)
    send invalidate to each node in info[p].copy_set
    wait for all invalidate confirmations
    info[p].copy_set = empty
    ask info[p].owner to send copy of p to request_node
    info[p].owner = request_node
    wait for confirmation from request_node
    unlock(info[p].lock)

InvalidateServer(PageNumber p):
  # no lock...
  ptable[p].access = nil
  send confirmation to manager

===================================

scenario 1: M0 has writeable copy, M1 wants to read
  [time diagram: M 0 1]
  0. page fault on M1, since page must have been marked invalid
  1. M1 sends RQ to MGR
  2. MGR sends RF to M0, MGR adds M1 to copy_set
  3. M0 marks page as access=read, sends RD to M1
  5. M1 marks access=read, sends RC to MGR

scenario 2: now M2 wants to write
  [time diagram: M 0 1 2]
  0. page fault on M2
  1. M2 sends WQ to MGR
  2. MGR sends IV to copy_set (i.e. M1)
  3. M1 sends IC msg to MGR
  4. MGR sends WF to M0, sets owner=M2, copy_set={}
  5. M0 sends WD to M2, access=none
  6. M2 marks r/w, sends WC to MGR

what if there were no RC message?
  i.e. MGR unlocked after sending RF?
  could RF be overtaken by subsequent WF?
  or does IV/IC+ptable[p].lock hold up any subsequent RF?
    but invalidate can't acquire ptable lock -- deadlock?

no IC?
  i.e. MGR didn't wait for holders of copies to ack?

no WC?
  e.g. MGR unlocked after sending WF to M0?
  MGR would send subsequent RF, WF to M2 (new owner)
  What if such a WF/RF arrived at M2 before WD?
    No problem! M2 has ptable[p].lock locked until it gets WD
  RC + info[p].lock prevents RF from being overtaken by a WF
  so it's not clear why WC is needed!
    but I am not confident in this conclusion

what if two machines want to write the same page at the same time?

what if one machine reads just as ownership is changing hands?

does IVY provide strict consistency?
  no: MGR might process two STs in order opposite to issue time
  no: ST may take a long time to revoke read access on other machines
      so LDs may get old data long after the ST issues

In what situations will IVY perform well?
  1. Page read by many machines, written by none
  2. Page written by just one machine at a time, not used at all by others
  Cool that IVY moves pages around in response to changing use patterns

Will page size of e.g. 4096 bytes be good or bad?
  good if spatial locality, i.e. program looks at large blocks of data
  bad if program writes just a few bytes in a page
    subsequent readers copy whole page just to get a few new bytes
  bad if false sharing
    i.e. two unrelated variables on the same page
      and at least one is frequently written
    page will bounce between different machines
      even read-only users of a non-changing variable will get invalidations
    even though those computers never use the same location

What about IVY's performance?
  after all, the point was speedup via parallelism

What's the best we could hope for in terms of performance?
  Nx faster on N machines

What might prevent us from getting Nx speedup?
  Network traffic (moving lots of pages)
  locks
  Many machines writing the same page
  application is inherently non-scalable

How well do they do?
  Figure 4: near-linear for PDE
  Figure 6: very sub-linear for sort
  Figure 7: near-linear for matrix multiply

Why did sort do poorly?
  Here's my guess
  Partitions data over machines
  Phase 1: Local sort of 2*N partitions for N machines
  Phase 2: 2N-1 merge-splits; each round sends all data over network
  Phase 1 probably gets linear speedup
  Phase 2 probably does not -- limited by LAN speed
    also more machines may mean more rounds
  So for small # machines, local sort dominates, more machines helps
  For large # machines, communication dominates, more machines don't help
  Also, more machines shifts from n*log(n) local sort to n^2 bubble-ish short

How could one speed up IVY?
  paper suggests splitting up MGR
  or eliminating MGR and using broadcast to find pages
  next week: relax the consistency model
    allow multiple writers to same page!

***

Paper intro says DSM subsumes RPC -- is that true?
  When would DSM be better than RPC?
    More transparent. Easier to program.
  When would RPC be better?
    Isolation. Control over communication. Tolerate latency.
    Portability. Define your own semantics.
  Might you still want RPC in your DSM system? For efficient sleep/wakeup?

Known oddities in Section 3.1 pseudo-code
  Fault handlers must wait for owner to send p before confirming to manager
  Deadlock if owner has page r/o and takes write fault
    Worrisome that no clear order ptable[p].lock vs info[p].lock
  Write server / manager must set owner=request_node
  Manager parts of fault handlers don't ask owner for the page
  Does processing of the invalidate request hold ptable[p].lock?
    probably can't -- deadlock