Shared Memory Multiprocessor

OS for parallel machine
mem ICN challenges
– numa effects
– deep mem hierarchy
– false sharing

principles
Cache conscious decisions
limit shared system data structure
keep memory accesses local
cpu – mem – icn – mem

CPU -> vpn -> TLB lookup -> miss -> PT lookup -> miss -> locate file -> I/O -> page frame -> vpn, pfn -> pt update -> vpn -> TLB update -> p.f. service complete

Parallel os + page fault service
easy scenario
-multiprocess workload
N1 T1 … Tn Nn
*threads independent
*page tables distinct
*no serialization
Hard Scenario
-multi thread workload
process T1, T2, T3, T4 shared address space
T1, T3 N1, T2, T4 N2
*address space shared
*page table shared
*shared entries in processor TLB’s

Recipe for scalable structure in parallel os
for every sub system
-determine functionally needs of that service
-to ensure concurrent execution of service
* minimize shared data structures
less sharing -> more scalable
– where possible replicate/ partition
system data structures
-> less locking
-> more concurrency

Tornado’s secret sauce: clustered object
object reference: illusion of single object, under the cover multiple representations
Degree of clustering
– choice of implementor of service
* singleton rep, one per core
* ppc for consistency
TLB – Process -> region -> FCM -> core – DRAM -> page

Scheduling

Scheduling – First Principles
Thread(T1) cpu – sync – I/O – kernel scheduler

FCFS, Highest static priority, highest dynamic priority, thread whose memory contents are in the cpu cache

Memory hierarchy refresher
cpu – l1-chache (1-2 cycles)- l2-chache(~10cycles) -> memory(~100 cycles)

cache affinity scheduling
T1 descheduled, t1 rescheduled
interventing threads on P1

FCFS: Ignores affinity for fairness
Fixed proccessor: Ti always on P fixed
Last processor: Ti on P last

Minimum Intervening: Ti -> Pj ^ Imi

implementation issues
queue based
-global que
-affinity-based local queues

Ti’s priority = BPi + agei + affinityi
-Determines poisition in the queue

Performance
Figures of merit
-throughput -> system centric
-response time -> user centric
-variance -> user centric

Multicore multithreaded processors

Lightweight RPC

RPC and client-server systems
e.g. File System
RPC: usually remote
client-server on same machine?
-performance vs. safety

RPC vs. Simple Procedure call
Procedure call
caller callee
process: all at compile time
RPC call – return
call trap in the Kernel – server procedure
return trap
All at run time

copying overhead
client stack -> client stub -> rpc msg -> kernel -> kernel buffer -> server -> server domain -> server stub -> server stack

How to remove overheads?
-set up (binding) => one time cost
Entry point, A-stack size, call
c s.foo import s foo => name server – grant kernel PD
client adr space = sm a-stack = server adr space

Making RPC cheap(actual calls)
client address space -> args -> A-stack -> result
Kernel
-> server address space

Original client stack
-> rpc msg -> kernel buffer -> server domain -> server stack

Now
client stack -> A-stack shared -> server stack : two copies, marshal, unmarshal

Queueing Lock

no request yet
me running
curr running
me spinning
me running
new

List for each lock
q node{
go-it;// T/F
next;
}
Lock(L,me)
* join L; //atomic
* await predecessor to signal; //spin
unlock(L, me)
* remove me from L;
* signal successor;

Barrier Synchronization
t1, t2, … Tn
Count = N;//init
Decrement(count);
if (count == 0)
count = N;
else
while(count > 0);
* spin by all excepiton

Communications
Tree barrier
Count locksense: arrival at the barrier

Shared Memory Machine

Dance Hall Architecture
CPU, cache, interconnection network, memory

Write invalidate
CPU -> cache (y->y’) -> Main Memory

Scalability
Expectation with more processors
perf, processors, expected, actual, overhead

Synchronization
primitives for shared memory programming

Barrier
T1, T2, …Tn

P1:modify struct(A)
P2:wait for mod; use struct(A);

Atomic operations
LOCK(L):
if(L==0)
L = 1;
else
while(L==1);
// wait
go back;

unlock(L);
L == 0;

atomic rmw instructions
test-and-set()
return current value in
set to 1

fetch-and-inc()
return current value in
increment []

Latency waiting time contention
Lock Algorithm
Barrier Algorithms

Naive Spinlock
Lock(L):
while(test(L) == locked);
  if(t + s(L)==locked) go back;

CPU & Device Virtualization

CPU virtualization
* illusion of owning the cpu
* handle program discontinues
Devise virtualization

First part
– illusion of ownership of cpu for each vm
Second part(common to full + pava)
– delivering events to parent guest OS
– address translation on every memory access

Full virtualization
-trap and emulate
– implicit guest -> hypervisor
-software interrupts(event) hypervisor -> guest
Para virtualization
-more opportunity for innovation
– explicit guest -> hypervisor
– software interrupts hypervisor -> guest

Data transfer
– implicit, explicit => opportunity to innovate

Xen’s async i/o rings

Measuring Time
cpu, memory storage, network

Memory Virtualization

Thorny issue
– handling virtual memory
key functionality

Applications
OS memory management subsystem
hardware physical memory

p1, p2
Windows, linux: physical memory
Hypervisor: machine memory
Hardware
each process in its own protection domain -> distinct PT in respective os

virtualized
VPN -> PT -> PPN -> S-PT -> MPN
s-pt updated by hypervisor
translation installed into TLB/hardware PT

Efficient Mapping
shift burden to guest os
-maintain contiguous “physical memory”
-map to discountiguous hardware pages

Virtualization

memory systems, data centers, jvm, virtual box, ibm vm, google glass, cloud computing, dalvik(android), vmware work station, inception, and so on.

Platform virtualization
Alize Inc.
app, app, app
->
windows, linux, windows
Shared Hardware

Hypervisors
Native(bare metal)
guest os, guest os, guest os
Hypervisor
Shared Hardware

Microkernel Approach

Microkernel-based OS structure
app1 app2 …
file system, memory manager, cpu scheduler: each service in its own address space
m kernel: simple abstraction, address space, IPC
Hardware

Border crossing
– implicit + explicit costs
Protected procedure calls
– 100x normal procedure calls

L3 Mkernel
Thesis: It’s all about efficient implementation
proof by construction to debunk myths about MicroKenerl-based os structure

Strikes Against Microkernel
Kernel-user switches
– border crossing cost
Address space swtiches
– basis for ppc for cross protection domain calls

Thread switches + IPC
– kernel mediation for ppc

Memory effects
– Locality loss
ff <-> storage

Debunking User Kernel Border Crossing Myth
-empirical proof
L3: processor cycles
* includes TLB and cache misses

Mach
– 900 cycles

SPIN and exokernel used mach as locsis for decrying MKernel-based design

Address space swtiches
VPN tag:index – match – TLB:tag PFN – PFN

Large Protection Domain
fs => hardware address space

Large protection domain
– need TLB-flush if not as – tagged
Hardware address space = storage
– explicit cost << implicit cost e.g., 864 cycles for TLB flush in pentium upshot for address space switching small protection domains large protection domains Memory effect VM, memory, L3, L2, L1, CPU TLB Reasons for Mach's expensive border crossing - focus on portability large memory, lesser locality, more cache missing

Exokernel Approach

Decouple Authorization from use
use bindings
library os: semantic of use in library
exokernel: ask for resource
hardware: bind library os to hardware resources

Examples of candidate resources
-TLB entry
* virtual to physical mapping done by library
* binding prerseated to exokernel
* exokernel puts it into hardware TLB
* process in library os uses multiple times without exokernel intervention
-Packet filter
* predicates loaded into kernel by library os
* checked on PKT arrival by exokernel

Implementing secure bindings
– hardware mechanisms
* e.g., TLB entry
– software caching
* shadow TLB in software for each library os

Library OS
Exokernel
TLB, CPU

CPU scheduling
– linear vector of “time slots”
time quantum

Revocation of resources
space(memory) and time(cpu)
library os: revoke(repossession vector)
exokernel