Imputation
– just the tip of the iceberg more sophisticated methods exist
– Fill in mean : simple and relatively effective
– Linear Regression
– Both have negative side effects and can obscure or simplify trend in data
Statistical Rigor
significance tests
– using data, can we disprove an assumption with a pre-defined level of confidence
Normal Distribution, Two parameters:
・μ(mean)
・σ(standard deviation)
f(x) = 1/√2πσ^2 * e * -(x-μ)^2 / 2σ^2
Mean = μ
Variance = σ^2
Statistical Significance Tests
t-test
Accept or reject a null hypothesis
NULL HYPOTHESIS: A statement we are trying to disprove by running test
import numpy import scipy.stats import pandas def compare_averages(filename): baseball_data = pandas.read_csv('../data/baseball_data.csv') baseball_data_left = baseball_data[baseball_data['handedness'] == 'L'] baseball_data_right = baseball_data[baseball_data['handedness'] == 'R'] result = scipy.stats.ttest_ind(baseball_data_left['avg'], baseball_data_right('avg'), equal_var=False) if result[1] <= .05: return (False, result) else: return (True, result) if __name__ == '__main__': result = compare_averages() print result
Machine Learning: A branch of artificial intelligence focused on constructing systems that learn from large amounts of data to make predictions.
statistics is focused on analyzing existing data and drawing valid conclusions.
machine learning is focused on making predictions.