Linux O(I)

O(I) == constant time to select/add task, regardless of task count

Preemptive, priority-based
-realtime (0-99)
-timesharing (100-139)

User processes
-default 120
-nice value(-20 to 19)

Timeslice Value
– depends on priority
– smallest for low priority
– highest for high priority
Feedback
– sleep time: waiting
– longer sleep => interactive
-> priority -5(boost)
– smaller sleep =>
compute intensive
-> priority +5(lower)

Runqueue == 2 array of tasks
Active
-used to pick next task to run
-constant time to add/select

Expired

multi-CPU sytem
shared memory multiprocessor (SMP)
multi core; shared LLC, memory, CPU w/ multiple cores, cores have private LI

Hyper threading(SMT)
– multiple hardware – supported execution context
– still 1 CPU, but with very fast context switch
-> hardware multithreading
-> hyper threading
-> chip multithreading(CMT)
if(t_idle > 2*t.ctx.switch)

TimeSlice

Timeslice == maximum amount of uninterrupted time given to a task
=> time quantum
Task may run less than timeslice time

balance benefits and overheads
– for I/O-bound tasks
– For CPU-bound tasks

CPU bound tasks
-2tasks, exec time = IOs
-ctx switch time = 0.1s

ts=1s, ts=5s

CPU bound tasks prefer longer timeslices
-> limits context switching overheads
-> keeps CPU utilization and throughput high
I/O bound tasks prefer shorter timeslices

Visual Metaphor

-dispatch orders immediately
scheduling is simple(FIFO)
-dispatch simple orders first
maximize # of orders processed over time
-dispatch complex orders first
keep workbenches busy

CPU scheduler
– decide how and when processes (and their threads) access shared CPUs
– schedules tasks running user-level processes/threads as well as kernel-level threads
– runs when
– cpu becomes idle
– new task becomes ready

context switch, enter user mode, set PC and go <- thread is dispatched "run-to-completion" scheduling initial assumptions - group of tasks/ jobs - known execution times - no preemption - single CPU Metrics throughput avg. job completion time avg. job wait time First-com First-Serve (FCFS) -schedules tasks in order of arrived rungueue == gueue(FIFO) T1 = 1s, T2 = 10s, T3 = 1s Throughput = 3/12s = 0.25 tasks/s Avg. Completion time =(1+11+12)/3 = 8sec Avg. wait time =(0+1+11)/3 = 4sec Shortest Job First(SJF) -schedules tasks in order of their execution time T1(ls) -> T3(ls) ->T2(10s)
runqueue = ordered queue(FIFO)

Throughput Formula:
jobs_completed / time_to_complete_all_job
Avg. Completion Time Formula:
sum_of_times_to_complete_each_job / jobs_completed
Avg. Wait Time Formula:
(t1_wait_time + t2_wait_time + t3_wait_time) / jobs_completed

critical

int in, out, buffer[BUFFERSIZE];
mutex_t m;
cond_var_t not_empty, not_full;

while(more_to_produce){
	mutex_lock(&m);
	if (out == (in + 1) % BUFFERSIZE) // buffer full
		condition_wait(&not_full);
	add_item(buffer[in]);
		in = (in + 1) % BUFFERSIZE
			cond_broadcast(&not_empty);
}

while (more_to_consume){
	mutex_lock(&m);
	if(out == in) // buffer empty
		condtion_wait(&not_empty);
	remove_item(out);
	out = (out + 1) % BUFFERSIZE;
	condtion_signal(&not_empty);
	
}

Experimental Results

Best Case Numbers

Synthetic load:
-N requests for same file
=> best case

Measure Bandwidth
-BW = N * bytes(F)/time
-File size 0-200kb
=> vary work per request

Observations:
-all exhibit similar results
-SPED has best performance
– Flash AMPED extra check for memory presence
– Zeus has anomaly

When data is in cache:
– SPED >> Amped Flash
– SPED and AMPED Flash >> HT/MP
With disk-bound workload
– AMPED FLASH >> SPED

Flash web server

Flash: Event-Driven web server
– an event driven webserver
– with asymmetric helper processes

Handlers, Event Dipatcher, Helpers

Flash: Addtional Optimizations

Start, accept conn, read request, parse request, find file, computer header, send hader, read file send data, End

Apache web server
core = basic server skelethon
module -> Apache core -> request・response

Setting up performance comparissons
-mp (each process single thread)
-mt(boss-worser)
-single process event-driven(sped)
-leus(sped w/2 processes)
-apache(vl31, MP)
compare against Flash(AMPED model)

Realistic request workload
=> distribution of web page accesses over time
Controlled, reproducible workload
=> trace-based (from real web servers)

CS web server trace, owlnet trace, synthetic workload

Bandwidth == bytes/time
=> total bytes xfered from files/ total time

Connection Rate == request/time
=> total client conn / total time

Tasks in Linux

task creation: clone
clone(function, stack_ptr, sharing_flags, args)

sharing_flags
CLINE_VM, create a new thread
CLONE_FS, share unmask, root, and working dirs
CLoE_FILES, share the file descriptiors
CLONE_SIGHAND, share the signal handler table
CLONE_PID, new thread gets old PID
CLINE_PARENT, new thread has same parent as caller

Native POSIX Thread Library(NPTL)
– kernel see each UTL info
– kernel traps are cheaper
– more resources: memory, large range of IDs…

Older LinuxTreads “H-H model”
– similar issues to thoese described in Solaris papers

Thread Performance Considerations
– performance comparisons
– multi-process vs. multi-threaded vs. event-driven
– Event-driven architectures
– “Flash: An Efficient and Portable Web server” vs Apache

Boss-worker “Ooms per toy order”
Pipeline “20ms per pipeline stage”

execution time avg time to complete order

Are threads useful?
threads are useful because of..
parallelization => speed up
specialization => hot cache!
efficiency => lower memory requirement and cheaper

For a matrix multiply application
=> execution time
-> depend on metrics

For a web service application …
=> number of client requests / time
=> response time
-> average, max, min, 95%

For hardware …
=> higher utilization (e.g. CPU)

Performance Metrics
metrics == a measurement standard
-measurable and/or quantifiable property

Thread Design Considerations

dynamic thread creation is expensive

Dynamic decision
– if handler doesn’t lock => execute on interrupted threads stack
– if handler can block => turn into real thread

Optimization
– precreate z prenitialize thread structures for interrupt routines

Threads and Signal Handling
– disable => clear signal
– signal occurs – what to do with the signal?

SIGNAL-N, handler-N-start-addr
user, kernel mask
Optimize Common Case
– signals less frequent than signal mask updates
– system calls avoided cheaper to update UL mask

Task struct in Linux
– main execution abstraction => task

Struct task_struct {
	// ...
	pid_t pid;
	pid_t tgid;
	int prio;
	volatile long state;
	struct mm_struct **mm;
	struct files_struct *files;
	struct list_head tasks;
	int on_cpu;
	cpumask_t cpus_allowed;
	// ...
}

Producer and Consumer

void *producer (void *param){
	int i;
	for (i = 1; i <= 20; i++){
		pthread_mutex_lock (&m);
			if(num > BUF_SIZE){ /* overflow */
				exit(1);
			}
			while (num == BUF_SIZE){
				pthread_cond_wait (&_prod, &m)
			}
			buffer[add] = i;
			add = (add + 1) % BUF_SIZE;
			num++;
		pthread_mutex_unlock (&m);

		pthread_cond_signal (&c_cons);
		printf ("producer: inserted %d\n", i); fflush(stdout);
	}

	printf ("producer quiting\n"); fflush(stdout);
	return 0;
}

Variable Safety Tips

– Do not forget to notify waiting threads!
– predicate change =>
signal/broadcast correct condition variable
– When in doblt broadcast
– but performance loss
– You do not need a mutex to signal/broadcast

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define BUF_SIZE 3 /* size of shared buffer */

int buffer[BUF_SIZE];
int add = 0;
int rem = 0;
int num = 0;

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t c_cons = PTHREAD_COND_INITIALIZER;
pthread_cond_t c_prod = PTHREAD_COND_INITIALIZER;

void *producer(void *param);
void *consumer(void *param);
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

#define BUF_SIZE 3 /* size of shared buffer */

int buffer[BUF_SIZE];
int add = 0;
int rem = 0;
int num = 0;

pthread_mutex_t m = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t c_cons = PTHREAD_COND_INITIALIZER;
pthread_cond_t c_prod = PTHREAD_COND_INITIALIZER;

void *producer(void *param);
void *consumer(void *param);

int main(int argc, char *argv[]){
	pthread_t tid1, tid2;
	int i;

	if (pthread_create(&tid1, NULL, producer, NULL) != 0){
		fprintf(stderr, "Unable to create producer thread\n");
		exit(1);
	}

	if (pthread_create(&tid2, NULL, consumer, NULL) != 0){
		fprintf(stderr, "Unable to create consumer thread\n");
		exit(1);
	}

	pthread_join(tid1, NULL); /* wait for producer to exit */
	pthread_join(tid2, NULL); /* wait for consumer to exit */
	printf("Parent quiting\n");
}