Tip of the Imputation Iceberg

Imputation
– just the tip of the iceberg more sophisticated methods exist
– Fill in mean : simple and relatively effective
– Linear Regression

– Both have negative side effects and can obscure or simplify trend in data

Statistical Rigor
significance tests
– using data, can we disprove an assumption with a pre-defined level of confidence

Normal Distribution, Two parameters:
・μ(mean)
・σ(standard deviation)

f(x) = 1/√2πσ^2 * e * -(x-μ)^2 / 2σ^2
Mean = μ
Variance = σ^2

Statistical Significance Tests
t-test
Accept or reject a null hypothesis
NULL HYPOTHESIS: A statement we are trying to disprove by running test

import numpy
import scipy.stats
import pandas

def compare_averages(filename):
	baseball_data = pandas.read_csv('../data/baseball_data.csv')

	baseball_data_left = baseball_data[baseball_data['handedness'] == 'L']
	baseball_data_right = baseball_data[baseball_data['handedness'] == 'R']

	result = scipy.stats.ttest_ind(baseball_data_left['avg'], baseball_data_right('avg'), equal_var=False)

	if result[1] <= .05:
		return (False, result)
	else:
		return (True, result)

if __name__ == '__main__':
	result = compare_averages()
	print result

Machine Learning: A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions.

statistics is focused on analyzing existing data and drawing valid conclusions.
machine learning is focused on making predictions.