Scheduling

Scheduling – First Principles
Thread(T1) cpu – sync – I/O – kernel scheduler

FCFS, Highest static priority, highest dynamic priority, thread whose memory contents are in the cpu cache

Memory hierarchy refresher
cpu – l1-chache (1-2 cycles)- l2-chache(~10cycles) -> memory(~100 cycles)

cache affinity scheduling
T1 descheduled, t1 rescheduled
interventing threads on P1

FCFS: Ignores affinity for fairness
Fixed proccessor: Ti always on P fixed
Last processor: Ti on P last

Minimum Intervening: Ti -> Pj ^ Imi

implementation issues
queue based
-global que
-affinity-based local queues

Ti’s priority = BPi + agei + affinityi
-Determines poisition in the queue

Performance
Figures of merit
-throughput -> system centric
-response time -> user centric
-variance -> user centric

Multicore multithreaded processors

Lightweight RPC

RPC and client-server systems
e.g. File System
RPC: usually remote
client-server on same machine?
-performance vs. safety

RPC vs. Simple Procedure call
Procedure call
caller callee
process: all at compile time
RPC call – return
call trap in the Kernel – server procedure
return trap
All at run time

copying overhead
client stack -> client stub -> rpc msg -> kernel -> kernel buffer -> server -> server domain -> server stub -> server stack

How to remove overheads?
-set up (binding) => one time cost
Entry point, A-stack size, call
c s.foo import s foo => name server – grant kernel PD
client adr space = sm a-stack = server adr space

Making RPC cheap(actual calls)
client address space -> args -> A-stack -> result
Kernel
-> server address space

Original client stack
-> rpc msg -> kernel buffer -> server domain -> server stack

Now
client stack -> A-stack shared -> server stack : two copies, marshal, unmarshal

Queueing Lock

no request yet
me running
curr running
me spinning
me running
new

List for each lock
q node{
go-it;// T/F
next;
}
Lock(L,me)
* join L; //atomic
* await predecessor to signal; //spin
unlock(L, me)
* remove me from L;
* signal successor;

Barrier Synchronization
t1, t2, … Tn
Count = N;//init
Decrement(count);
if (count == 0)
count = N;
else
while(count > 0);
* spin by all excepiton

Communications
Tree barrier
Count locksense: arrival at the barrier

Shared Memory Machine

Dance Hall Architecture
CPU, cache, interconnection network, memory

Write invalidate
CPU -> cache (y->y’) -> Main Memory

Scalability
Expectation with more processors
perf, processors, expected, actual, overhead

Synchronization
primitives for shared memory programming

Barrier
T1, T2, …Tn

P1:modify struct(A)
P2:wait for mod; use struct(A);

Atomic operations
LOCK(L):
if(L==0)
L = 1;
else
while(L==1);
// wait
go back;

unlock(L);
L == 0;

atomic rmw instructions
test-and-set()
return current value in
set to 1

fetch-and-inc()
return current value in
increment []

Latency waiting time contention
Lock Algorithm
Barrier Algorithms

Naive Spinlock
Lock(L):
while(test(L) == locked);
  if(t + s(L)==locked) go back;

CPU & Device Virtualization

CPU virtualization
* illusion of owning the cpu
* handle program discontinues
Devise virtualization

First part
– illusion of ownership of cpu for each vm
Second part(common to full + pava)
– delivering events to parent guest OS
– address translation on every memory access

Full virtualization
-trap and emulate
– implicit guest -> hypervisor
-software interrupts(event) hypervisor -> guest
Para virtualization
-more opportunity for innovation
– explicit guest -> hypervisor
– software interrupts hypervisor -> guest

Data transfer
– implicit, explicit => opportunity to innovate

Xen’s async i/o rings

Measuring Time
cpu, memory storage, network

Memory Virtualization

Thorny issue
– handling virtual memory
key functionality

Applications
OS memory management subsystem
hardware physical memory

p1, p2
Windows, linux: physical memory
Hypervisor: machine memory
Hardware
each process in its own protection domain -> distinct PT in respective os

virtualized
VPN -> PT -> PPN -> S-PT -> MPN
s-pt updated by hypervisor
translation installed into TLB/hardware PT

Efficient Mapping
shift burden to guest os
-maintain contiguous “physical memory”
-map to discountiguous hardware pages

Virtualization

memory systems, data centers, jvm, virtual box, ibm vm, google glass, cloud computing, dalvik(android), vmware work station, inception, and so on.

Platform virtualization
Alize Inc.
app, app, app
->
windows, linux, windows
Shared Hardware

Hypervisors
Native(bare metal)
guest os, guest os, guest os
Hypervisor
Shared Hardware

Microkernel Approach

Microkernel-based OS structure
app1 app2 …
file system, memory manager, cpu scheduler: each service in its own address space
m kernel: simple abstraction, address space, IPC
Hardware

Border crossing
– implicit + explicit costs
Protected procedure calls
– 100x normal procedure calls

L3 Mkernel
Thesis: It’s all about efficient implementation
proof by construction to debunk myths about MicroKenerl-based os structure

Strikes Against Microkernel
Kernel-user switches
– border crossing cost
Address space swtiches
– basis for ppc for cross protection domain calls

Thread switches + IPC
– kernel mediation for ppc

Memory effects
– Locality loss
ff <-> storage

Debunking User Kernel Border Crossing Myth
-empirical proof
L3: processor cycles
* includes TLB and cache misses

Mach
– 900 cycles

SPIN and exokernel used mach as locsis for decrying MKernel-based design

Address space swtiches
VPN tag:index – match – TLB:tag PFN – PFN

Large Protection Domain
fs => hardware address space

Large protection domain
– need TLB-flush if not as – tagged
Hardware address space = storage
– explicit cost << implicit cost e.g., 864 cycles for TLB flush in pentium upshot for address space switching small protection domains large protection domains Memory effect VM, memory, L3, L2, L1, CPU TLB Reasons for Mach's expensive border crossing - focus on portability large memory, lesser locality, more cache missing

Exokernel Approach

Decouple Authorization from use
use bindings
library os: semantic of use in library
exokernel: ask for resource
hardware: bind library os to hardware resources

Examples of candidate resources
-TLB entry
* virtual to physical mapping done by library
* binding prerseated to exokernel
* exokernel puts it into hardware TLB
* process in library os uses multiple times without exokernel intervention
-Packet filter
* predicates loaded into kernel by library os
* checked on PKT arrival by exokernel

Implementing secure bindings
– hardware mechanisms
* e.g., TLB entry
– software caching
* shadow TLB in software for each library os

Library OS
Exokernel
TLB, CPU

CPU scheduling
– linear vector of “time slots”
time quantum

Revocation of resources
space(memory) and time(cpu)
library os: revoke(repossession vector)
exokernel

Monolithic Structure

App1, App2, … Appn -> each app in its own hardware address space
OS Service and Devices Drivers -> os in its own hardware address space
Hardware -> managed by the OS

Dos-like
Microkernel-based
Monolithic

SPIN mechanisms for event
event-based communication model
event – spin dispatcher, event handlers

Ping, rpc, fwd, http
icmp, udp, tcp
ip pkt arrive
ip
ethernet pkt, atm pkt

default core services in spin
Memory management
-physical address
* allocate, deallocate, reclaim
-virtual address
* allocate, deallocate
-translation
* create/ destory AS, add/remove mapping
-Event handler
* page fault, access fault, badd address
CPU scheduling
-spin abstraction:strand
* semantics defined by extension
-event handlers
* block, unblock, checkpoint, resume
-spin global scheduler
* interact with appliction