Shared memory design considerations

-different APIs/mechanisms for synchronization
-os provides shared memory, and is out of the way
-data passing/sync protocols are up to the programmer

large segment => manager for allocating/freeing mem from shared segment
many small segments => use pool of segments, queue of segment ids
=> communicate segment IOs among processes

synchronization is like … waiting for a coworker to finish so you can continue working
-may repeatedly check to continue
-may wait for a signal to continue
-waiting hurts performance

mutexes
condition variables
why more?
error prone/correctness/ease-of-use
lack of expressive power

Low level support
-hardware atomic instructions

Spinlocks(basic sync construct)
spinlock is like a mutex
-mutual exlusion
-lock and unlock(free)

spinlock_lock(s);
	// critical section
spinlock_unlock(s);

semaphores
-common sync construct in OS kernels
-like a traffic light: STOP and GO
-similar to a mutex… but more general

on init
-assigned a max value(positive int)
on try(wait)
– if non-zero => decrement and proceed
if initialized with
– semaphore == mutex (binary semaphore)
on exit(post)
– increment

#include <semaphore.h>

sem_t sem;
sem_init(sem_t *sem, int pshared, int count);
sem_wait(sem_t *sem);
sem_post(sem_t *sem);

POSIX shared memory API

1. shm_open()
-returns file descriptor
-in *tmpfs*
2. mmap() and unmmap()
– mapping virtual => physical addresses
3. shm_close()
4. shm_unline()

shared memory and synchronization
“like threads accessing shared state in a single address space… but for processes”

synchronization method…
1. mechanisms supported by process threding library
2. os-supported IPC for synchronization

// ... make shm data struct
typedef struct {
	pthread_mutex_t mutex;
	char *data;
} shm_data_struct, *shm_data_struct_t;

// ...create shm segment
seg = shmget(ftok(arg[0]. 120), 1024, IPC_CREATE|IPC_EXCL);
shm_address = shmat(seg, (void *) 0, 0);
shm_ptr = (shm_data_struct_t_)shm_address;

// ...create and init mutex
pthread_mutexattr_t(&m_attr);
pthread_mutexattr_set_pshared(&m_attr,PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&shm_ptr_putex, &m_attr);

Inter-Process Communication

IPC == OS-supported mechanisms for interaction among processes (coordination & communication)

– message passing
e.g. sockets, pipes, message queues
– memory-based IPC
shared memory, mem mapped files…
– higher-level semantics
files, RPC
synchronization primitives

Message Passing
send/recv of messages
– buffer, FIFO queue…
OS provides interface to processes – a port
-processes send/write messages to a port
-processes recv/read messages from a port

send: system call + data copy
recv: system call + data copy

simplicity: kernel does channel management and synchronization

Shared memory IPC
read and write to shared memory region
– OS establishes shared channel b/w the processes
1. physical pages mapped into virtual address space
2. VA(PI) and VA(P2) map to the same physical address
3. VA(PI) =/ VA(P2)
4. physical memory does not need to be contiguous

Copy(messages) vs. map(shared memory)
Goal: transfer data from one into target address space
copy:
– cpu cycles to copy data to/from port
map:
– cpu cycles to map memory address space

SysV shared memory overview
1. create
– os assigns unique key
2. Attach
– map virtual => physical addresses
3. Detach
– invalidate addr. mappings
4. Destroy
– only remove when explicitly deleted (or reboot)

1. shmget(shmid, size, flag)
– create or open
ftok (pathname, prg-id)
– same args => same key
2. shmat(shmid, addr, flags)
– addr = NULL => arbitrary
– cast addr to arbitrary type
3. shmdt(shmid)
4. shmctl(shmid, cmd, buf)
– destroy wit IPC.RHID

Memory Management

physical and virtual memory management

physical memory(DRAM)
virtual vs. Physical memory
allocate
– allocation replacement
arbitrate
– address translation and validation

Page-based memory management
Allocate => pages -> page frames
Arbitrate => page tables
Segment-based memory management
Allocate segment
Arbitrate segment registers

CPU package (cpu, mmu) => main memory
MMU
-> translate virtual to physical addresses
-> reports faults: illegal access, permission

Cache – translation lookaside buffer
– valid VA-PA translations:TLB

pages(VPN:offset) – page table “map” (PEN:offset) – DRAM

Page Table Entry
page frame number x w r a d p
Flags
– present(valid/invalid), Dirty(written to), accessed(for read or write)
Page Fault

32-bit architecture
– page table entry
=> 4 bytes, including PEN + flags
– virtural page number(VPN)
=> 2^32 / page size
– page size
=> 4 KB

Linux O(I)

O(I) == constant time to select/add task, regardless of task count

Preemptive, priority-based
-realtime (0-99)
-timesharing (100-139)

User processes
-default 120
-nice value(-20 to 19)

Timeslice Value
– depends on priority
– smallest for low priority
– highest for high priority
Feedback
– sleep time: waiting
– longer sleep => interactive
-> priority -5(boost)
– smaller sleep =>
compute intensive
-> priority +5(lower)

Runqueue == 2 array of tasks
Active
-used to pick next task to run
-constant time to add/select

Expired

multi-CPU sytem
shared memory multiprocessor (SMP)
multi core; shared LLC, memory, CPU w/ multiple cores, cores have private LI

Hyper threading(SMT)
– multiple hardware – supported execution context
– still 1 CPU, but with very fast context switch
-> hardware multithreading
-> hyper threading
-> chip multithreading(CMT)
if(t_idle > 2*t.ctx.switch)

TimeSlice

Timeslice == maximum amount of uninterrupted time given to a task
=> time quantum
Task may run less than timeslice time

balance benefits and overheads
– for I/O-bound tasks
– For CPU-bound tasks

CPU bound tasks
-2tasks, exec time = IOs
-ctx switch time = 0.1s

ts=1s, ts=5s

CPU bound tasks prefer longer timeslices
-> limits context switching overheads
-> keeps CPU utilization and throughput high
I/O bound tasks prefer shorter timeslices

Visual Metaphor

-dispatch orders immediately
scheduling is simple(FIFO)
-dispatch simple orders first
maximize # of orders processed over time
-dispatch complex orders first
keep workbenches busy

CPU scheduler
– decide how and when processes (and their threads) access shared CPUs
– schedules tasks running user-level processes/threads as well as kernel-level threads
– runs when
– cpu becomes idle
– new task becomes ready

context switch, enter user mode, set PC and go <- thread is dispatched "run-to-completion" scheduling initial assumptions - group of tasks/ jobs - known execution times - no preemption - single CPU Metrics throughput avg. job completion time avg. job wait time First-com First-Serve (FCFS) -schedules tasks in order of arrived rungueue == gueue(FIFO) T1 = 1s, T2 = 10s, T3 = 1s Throughput = 3/12s = 0.25 tasks/s Avg. Completion time =(1+11+12)/3 = 8sec Avg. wait time =(0+1+11)/3 = 4sec Shortest Job First(SJF) -schedules tasks in order of their execution time T1(ls) -> T3(ls) ->T2(10s)
runqueue = ordered queue(FIFO)

Throughput Formula:
jobs_completed / time_to_complete_all_job
Avg. Completion Time Formula:
sum_of_times_to_complete_each_job / jobs_completed
Avg. Wait Time Formula:
(t1_wait_time + t2_wait_time + t3_wait_time) / jobs_completed

critical

int in, out, buffer[BUFFERSIZE];
mutex_t m;
cond_var_t not_empty, not_full;

while(more_to_produce){
	mutex_lock(&m);
	if (out == (in + 1) % BUFFERSIZE) // buffer full
		condition_wait(&not_full);
	add_item(buffer[in]);
		in = (in + 1) % BUFFERSIZE
			cond_broadcast(&not_empty);
}

while (more_to_consume){
	mutex_lock(&m);
	if(out == in) // buffer empty
		condtion_wait(&not_empty);
	remove_item(out);
	out = (out + 1) % BUFFERSIZE;
	condtion_signal(&not_empty);
	
}

Experimental Results

Best Case Numbers

Synthetic load:
-N requests for same file
=> best case

Measure Bandwidth
-BW = N * bytes(F)/time
-File size 0-200kb
=> vary work per request

Observations:
-all exhibit similar results
-SPED has best performance
– Flash AMPED extra check for memory presence
– Zeus has anomaly

When data is in cache:
– SPED >> Amped Flash
– SPED and AMPED Flash >> HT/MP
With disk-bound workload
– AMPED FLASH >> SPED

Flash web server

Flash: Event-Driven web server
– an event driven webserver
– with asymmetric helper processes

Handlers, Event Dipatcher, Helpers

Flash: Addtional Optimizations

Start, accept conn, read request, parse request, find file, computer header, send hader, read file send data, End

Apache web server
core = basic server skelethon
module -> Apache core -> request・response

Setting up performance comparissons
-mp (each process single thread)
-mt(boss-worser)
-single process event-driven(sped)
-leus(sped w/2 processes)
-apache(vl31, MP)
compare against Flash(AMPED model)

Realistic request workload
=> distribution of web page accesses over time
Controlled, reproducible workload
=> trace-based (from real web servers)

CS web server trace, owlnet trace, synthetic workload

Bandwidth == bytes/time
=> total bytes xfered from files/ total time

Connection Rate == request/time
=> total client conn / total time