Systems from Components

specify -> code -> optimise -> deploy
ioa, ocaml, nuprl

digging deeper from spec to implementation

Spring Operating System
-> spring system
Java RMI, EJB

How to innovate OS?
brand new os? or better implementation of known os?
Market place needs
-large complex server software
=>Take the Intel inside approach… in the case unix inside for box makers like sun microsystems

Spring Approach
Strong interfaces, open, flexible, and extensible
conncet to other machines, Mkernel(threads/ipc)
All outside the kernel

Nucleus – “microkernel” of Spring
Domain, Door table, Door IDs, nucleus, door, target domain
Object invocation across the network
client domain, server domain, proxy A・B、Nucleus

client domain -> doory nucleusB -> front object -> underlying object -> ACL

Sources of overhead in RPC

marshaling
data copying
control transfer
protocol transfer

Three copies
-client stub
-kernel buffer
-dma to controller

Reducing copies?
-marshal into kernel buffer directly
-shared descriptors between client stub and kernel

Routing on the internet
Application
transport
network
link
physical

network, link, physical
App payload|qos
protocol stack++
IP-hdr code payload
dst dst dst

Potential Apps
protocol independent multicast
reliable multicast
congestion notification
private ip(PIP)
any casting

Distributed system definition

N1, N2, … Nn
LAN/WAN
fiber, cable, satellite
no physical shared memory between nodes
N1, N2

Even a cluster is a distributed system

Belief!
-processes sequential
-> events totally ordered
h->i, f->g, d->e..
-send before receive
->a-b, e->f,…

Lamport’s logical clock
each node
* its own events
* its communication events
Lamport’s logical clock
* monotonic increase of own event time
condition 1: Ci(a) by lamport’s
logical clocks plus PID to break ties

Latency
Elasped time
Throughput
Event per unit time
Bandwidth: throuput measure
RPC performance
Hardware overhead
Software overhead
Foucs of this lesson
How to reduce software overhead

Components of RPC Latency
1.client call
2.Controller latency
3.Time on wire
4.Interrupt handling
5.Server setup to execute call

MPI’s interface

MPI_Init(&argc,&argv),
MPI_Barrier(MPI_WORLD_COMM);
MPI_Finalize(),

optimize counter

#include 
#include 
#include 
#include "gtmp.h"

static MPI_Status* status_array;
static int P;

void gtmpi_init(int num_threads){
	P = num_threads;
	status_array = (MPI_Status*) malloc((P - 1)* sizeof(MPI_Status));
}

void gtmpi_barrier(){
	int vpid, i;

	MPI_Comm_rank(MPI_COMM_WORLD, &vpid);

	for(i = 0; i < vpid; i++)
		MPI_Send(NULL, 0, MPI_INT, i, 1, MPI_COMM_WORLD);
	for(i = vpid + 1; i < P; i++)
		MPI_Send(NULL, 0, MPI_INT, i, 1, MPI_COMM_WORLD);

	for(i = 0; i < vpid; i++)
		MPI_Recv(NULL, 0, MPI_INT, i, 1, MPI_COMM_WORLD, &status_array[i]);
	for(i = vpid + 1; i < P; i++)
		MPI_Recv(NULL, 0, MPI_INT, i, 1, MPI_COMM_WORLD, &status_array[i-1]);	
}
void gtmpi_finalize(){
	if(status_array != NULL){
		free(status_array);
	}
}

Optimize Tree

#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#include "gtmp.h"

/*
From the MCS Paper: A sense-reversing centralized barrier
shared count: integer := P
shared sense : Boolean := true
processor private local_sense : Boolean := true
	local_sense := not local_sense // each processor
if fetch_and_decrement(&count) = 1
	count := P
	sense := local_sense
	else
		repeat until sense = local_sense
*/
typedef struct _node_t{
	int k;
	int count;
	int locksense;
	struct _node_t* parent;
} node_t;

static int num_leaves;
static node_t* nodes;

void gtmp_barrier_aux(node_t* node, int sense);

node_t* _gtmp_get_node(int i){
	return &nodes[i];
}

void gtmp_init(int num_threads){
	int i, v, num_nodes;
	node_t* curnode;

	/*Setting constants */
	v = 1;
	while( v < num_threads)
		v *= 2;

	num_nodes = v - 1;
	num_leaves = v/2;

	/* Setting up the tree */
	nodes = (node_t*) malloc(num_nodes * sizeof(node_t));

	for(i = 0; i < num_nodes; i++){
		curnode = _gtmp_get_node(i);
		curnode->k = i < num_threads - 1 ? 1 : 1;
		curnode->count = curnode->k;
		curnode->locksense = 0;
		curnode->parent = _gtmp_get_node((i-1)/2);
	}

	curnode = _gtmp_get_node(0);
	curnode->parent = NULL;
}

void gtmp_barrier(){
	node_t* mynode;
	int sense;

	mynode = _gtmp_get_node(num_leaves - 1 + (omp_get_thread_num() % num_leaves));
	sense = !mynode->locksense;
	gtmp_barrier_aux(mynode, sense);		
}

void gtmp_barrier_aux(node_t* node, int sense){
	int test;

#pragma omp critical
{
	test = node->count;
	node->count--;
}

	if( 1 == test ){
		if(node->parent != NULL)
			gtmp_barrier_aux(node->parent, sense);
		node->count = node->k;
		node->locksense = !node->locksense;
	}
	while(node->locksense != sense);
}
void gtmp_finalize(){
	free(nodes);
}

Shared Memory Multiprocessor

OS for parallel machine
mem ICN challenges
– numa effects
– deep mem hierarchy
– false sharing

principles
Cache conscious decisions
limit shared system data structure
keep memory accesses local
cpu – mem – icn – mem

CPU -> vpn -> TLB lookup -> miss -> PT lookup -> miss -> locate file -> I/O -> page frame -> vpn, pfn -> pt update -> vpn -> TLB update -> p.f. service complete

Parallel os + page fault service
easy scenario
-multiprocess workload
N1 T1 … Tn Nn
*threads independent
*page tables distinct
*no serialization
Hard Scenario
-multi thread workload
process T1, T2, T3, T4 shared address space
T1, T3 N1, T2, T4 N2
*address space shared
*page table shared
*shared entries in processor TLB’s

Recipe for scalable structure in parallel os
for every sub system
-determine functionally needs of that service
-to ensure concurrent execution of service
* minimize shared data structures
less sharing -> more scalable
– where possible replicate/ partition
system data structures
-> less locking
-> more concurrency

Tornado’s secret sauce: clustered object
object reference: illusion of single object, under the cover multiple representations
Degree of clustering
– choice of implementor of service
* singleton rep, one per core
* ppc for consistency
TLB – Process -> region -> FCM -> core – DRAM -> page

Scheduling

Scheduling – First Principles
Thread(T1) cpu – sync – I/O – kernel scheduler

FCFS, Highest static priority, highest dynamic priority, thread whose memory contents are in the cpu cache

Memory hierarchy refresher
cpu – l1-chache (1-2 cycles)- l2-chache(~10cycles) -> memory(~100 cycles)

cache affinity scheduling
T1 descheduled, t1 rescheduled
interventing threads on P1

FCFS: Ignores affinity for fairness
Fixed proccessor: Ti always on P fixed
Last processor: Ti on P last

Minimum Intervening: Ti -> Pj ^ Imi

implementation issues
queue based
-global que
-affinity-based local queues

Ti’s priority = BPi + agei + affinityi
-Determines poisition in the queue

Performance
Figures of merit
-throughput -> system centric
-response time -> user centric
-variance -> user centric

Multicore multithreaded processors

Lightweight RPC

RPC and client-server systems
e.g. File System
RPC: usually remote
client-server on same machine?
-performance vs. safety

RPC vs. Simple Procedure call
Procedure call
caller callee
process: all at compile time
RPC call – return
call trap in the Kernel – server procedure
return trap
All at run time

copying overhead
client stack -> client stub -> rpc msg -> kernel -> kernel buffer -> server -> server domain -> server stub -> server stack

How to remove overheads?
-set up (binding) => one time cost
Entry point, A-stack size, call
c s.foo import s foo => name server – grant kernel PD
client adr space = sm a-stack = server adr space

Making RPC cheap(actual calls)
client address space -> args -> A-stack -> result
Kernel
-> server address space

Original client stack
-> rpc msg -> kernel buffer -> server domain -> server stack

Now
client stack -> A-stack shared -> server stack : two copies, marshal, unmarshal

Queueing Lock

no request yet
me running
curr running
me spinning
me running
new

List for each lock
q node{
go-it;// T/F
next;
}
Lock(L,me)
* join L; //atomic
* await predecessor to signal; //spin
unlock(L, me)
* remove me from L;
* signal successor;

Barrier Synchronization
t1, t2, … Tn
Count = N;//init
Decrement(count);
if (count == 0)
count = N;
else
while(count > 0);
* spin by all excepiton

Communications
Tree barrier
Count locksense: arrival at the barrier

Shared Memory Machine

Dance Hall Architecture
CPU, cache, interconnection network, memory

Write invalidate
CPU -> cache (y->y’) -> Main Memory

Scalability
Expectation with more processors
perf, processors, expected, actual, overhead

Synchronization
primitives for shared memory programming

Barrier
T1, T2, …Tn

P1:modify struct(A)
P2:wait for mod; use struct(A);

Atomic operations
LOCK(L):
if(L==0)
L = 1;
else
while(L==1);
// wait
go back;

unlock(L);
L == 0;

atomic rmw instructions
test-and-set()
return current value in
set to 1

fetch-and-inc()
return current value in
increment []

Latency waiting time contention
Lock Algorithm
Barrier Algorithms

Naive Spinlock
Lock(L):
while(test(L) == locked);
  if(t + s(L)==locked) go back;