POSIX shared memory API

1. shm_open()
-returns file descriptor
-in *tmpfs*
2. mmap() and unmmap()
– mapping virtual => physical addresses
3. shm_close()
4. shm_unline()

shared memory and synchronization
“like threads accessing shared state in a single address space… but for processes”

synchronization method…
1. mechanisms supported by process threding library
2. os-supported IPC for synchronization

// ... make shm data struct
typedef struct {
	pthread_mutex_t mutex;
	char *data;
} shm_data_struct, *shm_data_struct_t;

// ...create shm segment
seg = shmget(ftok(arg[0]. 120), 1024, IPC_CREATE|IPC_EXCL);
shm_address = shmat(seg, (void *) 0, 0);
shm_ptr = (shm_data_struct_t_)shm_address;

// ...create and init mutex
pthread_mutexattr_t(&m_attr);
pthread_mutexattr_set_pshared(&m_attr,PTHREAD_PROCESS_SHARED);
pthread_mutex_init(&shm_ptr_putex, &m_attr);

Inter-Process Communication

IPC == OS-supported mechanisms for interaction among processes (coordination & communication)

– message passing
e.g. sockets, pipes, message queues
– memory-based IPC
shared memory, mem mapped files…
– higher-level semantics
files, RPC
synchronization primitives

Message Passing
send/recv of messages
– buffer, FIFO queue…
OS provides interface to processes – a port
-processes send/write messages to a port
-processes recv/read messages from a port

send: system call + data copy
recv: system call + data copy

simplicity: kernel does channel management and synchronization

Shared memory IPC
read and write to shared memory region
– OS establishes shared channel b/w the processes
1. physical pages mapped into virtual address space
2. VA(PI) and VA(P2) map to the same physical address
3. VA(PI) =/ VA(P2)
4. physical memory does not need to be contiguous

Copy(messages) vs. map(shared memory)
Goal: transfer data from one into target address space
copy:
– cpu cycles to copy data to/from port
map:
– cpu cycles to map memory address space

SysV shared memory overview
1. create
– os assigns unique key
2. Attach
– map virtual => physical addresses
3. Detach
– invalidate addr. mappings
4. Destroy
– only remove when explicitly deleted (or reboot)

1. shmget(shmid, size, flag)
– create or open
ftok (pathname, prg-id)
– same args => same key
2. shmat(shmid, addr, flags)
– addr = NULL => arbitrary
– cast addr to arbitrary type
3. shmdt(shmid)
4. shmctl(shmid, cmd, buf)
– destroy wit IPC.RHID

Memory Management

physical and virtual memory management

physical memory(DRAM)
virtual vs. Physical memory
allocate
– allocation replacement
arbitrate
– address translation and validation

Page-based memory management
Allocate => pages -> page frames
Arbitrate => page tables
Segment-based memory management
Allocate segment
Arbitrate segment registers

CPU package (cpu, mmu) => main memory
MMU
-> translate virtual to physical addresses
-> reports faults: illegal access, permission

Cache – translation lookaside buffer
– valid VA-PA translations:TLB

pages(VPN:offset) – page table “map” (PEN:offset) – DRAM

Page Table Entry
page frame number x w r a d p
Flags
– present(valid/invalid), Dirty(written to), accessed(for read or write)
Page Fault

32-bit architecture
– page table entry
=> 4 bytes, including PEN + flags
– virtural page number(VPN)
=> 2^32 / page size
– page size
=> 4 KB

Linux O(I)

O(I) == constant time to select/add task, regardless of task count

Preemptive, priority-based
-realtime (0-99)
-timesharing (100-139)

User processes
-default 120
-nice value(-20 to 19)

Timeslice Value
– depends on priority
– smallest for low priority
– highest for high priority
Feedback
– sleep time: waiting
– longer sleep => interactive
-> priority -5(boost)
– smaller sleep =>
compute intensive
-> priority +5(lower)

Runqueue == 2 array of tasks
Active
-used to pick next task to run
-constant time to add/select

Expired

multi-CPU sytem
shared memory multiprocessor (SMP)
multi core; shared LLC, memory, CPU w/ multiple cores, cores have private LI

Hyper threading(SMT)
– multiple hardware – supported execution context
– still 1 CPU, but with very fast context switch
-> hardware multithreading
-> hyper threading
-> chip multithreading(CMT)
if(t_idle > 2*t.ctx.switch)

TimeSlice

Timeslice == maximum amount of uninterrupted time given to a task
=> time quantum
Task may run less than timeslice time

balance benefits and overheads
– for I/O-bound tasks
– For CPU-bound tasks

CPU bound tasks
-2tasks, exec time = IOs
-ctx switch time = 0.1s

ts=1s, ts=5s

CPU bound tasks prefer longer timeslices
-> limits context switching overheads
-> keeps CPU utilization and throughput high
I/O bound tasks prefer shorter timeslices

Visual Metaphor

-dispatch orders immediately
scheduling is simple(FIFO)
-dispatch simple orders first
maximize # of orders processed over time
-dispatch complex orders first
keep workbenches busy

CPU scheduler
– decide how and when processes (and their threads) access shared CPUs
– schedules tasks running user-level processes/threads as well as kernel-level threads
– runs when
– cpu becomes idle
– new task becomes ready

context switch, enter user mode, set PC and go <- thread is dispatched "run-to-completion" scheduling initial assumptions - group of tasks/ jobs - known execution times - no preemption - single CPU Metrics throughput avg. job completion time avg. job wait time First-com First-Serve (FCFS) -schedules tasks in order of arrived rungueue == gueue(FIFO) T1 = 1s, T2 = 10s, T3 = 1s Throughput = 3/12s = 0.25 tasks/s Avg. Completion time =(1+11+12)/3 = 8sec Avg. wait time =(0+1+11)/3 = 4sec Shortest Job First(SJF) -schedules tasks in order of their execution time T1(ls) -> T3(ls) ->T2(10s)
runqueue = ordered queue(FIFO)

Throughput Formula:
jobs_completed / time_to_complete_all_job
Avg. Completion Time Formula:
sum_of_times_to_complete_each_job / jobs_completed
Avg. Wait Time Formula:
(t1_wait_time + t2_wait_time + t3_wait_time) / jobs_completed

critical

int in, out, buffer[BUFFERSIZE];
mutex_t m;
cond_var_t not_empty, not_full;

while(more_to_produce){
	mutex_lock(&m);
	if (out == (in + 1) % BUFFERSIZE) // buffer full
		condition_wait(&not_full);
	add_item(buffer[in]);
		in = (in + 1) % BUFFERSIZE
			cond_broadcast(&not_empty);
}

while (more_to_consume){
	mutex_lock(&m);
	if(out == in) // buffer empty
		condtion_wait(&not_empty);
	remove_item(out);
	out = (out + 1) % BUFFERSIZE;
	condtion_signal(&not_empty);
	
}

Experimental Results

Best Case Numbers

Synthetic load:
-N requests for same file
=> best case

Measure Bandwidth
-BW = N * bytes(F)/time
-File size 0-200kb
=> vary work per request

Observations:
-all exhibit similar results
-SPED has best performance
– Flash AMPED extra check for memory presence
– Zeus has anomaly

When data is in cache:
– SPED >> Amped Flash
– SPED and AMPED Flash >> HT/MP
With disk-bound workload
– AMPED FLASH >> SPED

Flash web server

Flash: Event-Driven web server
– an event driven webserver
– with asymmetric helper processes

Handlers, Event Dipatcher, Helpers

Flash: Addtional Optimizations

Start, accept conn, read request, parse request, find file, computer header, send hader, read file send data, End

Apache web server
core = basic server skelethon
module -> Apache core -> request・response

Setting up performance comparissons
-mp (each process single thread)
-mt(boss-worser)
-single process event-driven(sped)
-leus(sped w/2 processes)
-apache(vl31, MP)
compare against Flash(AMPED model)

Realistic request workload
=> distribution of web page accesses over time
Controlled, reproducible workload
=> trace-based (from real web servers)

CS web server trace, owlnet trace, synthetic workload

Bandwidth == bytes/time
=> total bytes xfered from files/ total time

Connection Rate == request/time
=> total client conn / total time

Tasks in Linux

task creation: clone
clone(function, stack_ptr, sharing_flags, args)

sharing_flags
CLINE_VM, create a new thread
CLONE_FS, share unmask, root, and working dirs
CLoE_FILES, share the file descriptiors
CLONE_SIGHAND, share the signal handler table
CLONE_PID, new thread gets old PID
CLINE_PARENT, new thread has same parent as caller

Native POSIX Thread Library(NPTL)
– kernel see each UTL info
– kernel traps are cheaper
– more resources: memory, large range of IDs…

Older LinuxTreads “H-H model”
– similar issues to thoese described in Solaris papers

Thread Performance Considerations
– performance comparisons
– multi-process vs. multi-threaded vs. event-driven
– Event-driven architectures
– “Flash: An Efficient and Portable Web server” vs Apache

Boss-worker “Ooms per toy order”
Pipeline “20ms per pipeline stage”

execution time avg time to complete order

Are threads useful?
threads are useful because of..
parallelization => speed up
specialization => hot cache!
efficiency => lower memory requirement and cheaper

For a matrix multiply application
=> execution time
-> depend on metrics

For a web service application …
=> number of client requests / time
=> response time
-> average, max, min, 95%

For hardware …
=> higher utilization (e.g. CPU)

Performance Metrics
metrics == a measurement standard
-measurable and/or quantifiable property