Machine Learning – Page 4 – ソフトウェアエンジニアの技術ブログ：Software engineer tech blog

machine learning anotation

Annotation is critical to AI development and operation.
It is important to create large amount of accurate annotation data
Analysis and improvement of annotation process by machine learning.

Annotations are processes located upstream of pipeline. Therefore, if there are many errors in the annotation, it may have a fatal effect on subsequent processes, including model learning and evaluation(in many cases, evaluation data is also generated by the annotation).

Why is annotation important?
To unify the content to be read from the data.
In the upstream process of the AI pipeline, it has a fatal impact on the leader processes such as model learning and evaluation.

アノテーションが重要そうだ、というのは直ぐに気づきますが、研究が活発な分野っぽいですね。

difference between deep learning and machine learning

先日、deep learningのサービスについて会話をしていたら、私が”machine learning”と言ったら、それは「machine learning ではなくdeep learning」と突っ込まれた。ということで、deep learning と machine learningの違いについて。

Deep learning is a further development of machine learning. The major difference from conventional machine learning is that the framework used to analyze information and data is different. This is a “neural network” created by imitating human nerves, making computer analysis and learning of data powerful.

Although there are AI mechanisms for “machine learning” and “deep learning”, it can be said that there is a difference that automation of functional enhancement is being promoted. In particular, it can be said that the system is evolving in that it automatically finds out where to look for when distinguishing the object of analysis.

メルカリ機械学習

tech系のイベント@六本木ヒルズ18F メルカリ

ビールとご飯を無料配布
ビールは久しぶりに飲んだわ

で、学んだこと
1. apple, amazon, googleをwatchしている
2. 学会の論文を読んで、それをモデリングに落として試している
3. mercariはawsを使っている
4. 機械学習ができて当然は本当だった
5. MLチーム構成は半分日本人、次にインド人
6. 全部出来るやつが強い(当たり前か。。)

どうすっかなー

Amazon ML を触ろう

ガンガン行きます。

というか、Amazon Redshiftって何だ。。。 dataを保存できるのか。

modeling process

モデリングしてる模様

completed

Try real-time prediction

ここで予想します。

うーん、なんとなく流れは分かったようなわからないような。
データセットが用意されていると、単なる作業で頭使わないので駄目ですね。

AWS machine learning

Amazon machine learning モデルの概念図

what is amazon machine learning
amazon ML can be used to make predictions for a variety of purposes. For example, you could build a model in Amazon ML that will predict whether a given customer is likely to respond to a marketing offer. Amazon ML creates models from supervised data sets. This means that the model is based on a set of previous observations. This set of observations consists of features or attributes as well as the target outcome. In the marketing offer example, the features might include the age, profession, and gender of the customer. The target outcome (also called the target variable) would be whether that particular customer responded to the marketing offer or not.

The process of creating a model from a set of known observations is called training. Once you have trained a model in Amazon ML, you can then use the model to predict outcomes from a set of attributes that matches the attributes used to train the model. Amazon ML scales so that you can make thousands of predictions concurrently. This is important, as today machine learning is often used to provide predictions in near real-time. In this lab, you will be using a machine learning model to predict which restaurants a customer is likely to favor based on the results of a search query.

data setをs3のバケットのuploadする。

なるほど、S3はこういう風に使うのね。

machine learningを選択する

流石にまだ翻訳されてないな。

scikit-learnのアルゴリズム

画像認識はこちらになりますね。
Classification : Identifying to which category an object belongs to.
SVM, nearest neighbors, random forest, …
Spam detection, Image recognition.

株価の予測はこちらでしょうか。
Regression：Predicting a continuous-valued attribute associated with an object.
SVR, ridge regression, Lasso, …
Drug response, Stock prices.

属性の分類はこちらです。Web解析でデモグラフィックをグルーピングするにはこちらでしょうね。
Clustering：Automatic grouping of similar objects into sets.
k-Means, spectral clustering, mean-shift, …
Customer segmentation, Grouping experiment outcomes

これは、よくわかりません。変数を減らしていく、ビジュアリゼーションと記載がありますね。。
Dimensionality reduction：Reducing the number of random variables to consider.
Algorithms: PCA, feature selection, non-negative matrix factorization.
Visualization, Increased efficiency

これもあまり馴染みがないですね。モデリングの調整ということは理解ができますが。
Model selection：Comparing, validating and choosing parameters and models.
grid search, cross validation, metrics.
Improved accuracy via parameter tuning

テキストを機械学習に組み込むと書いてますね。どういうことでしょうか。翻訳の精度を上げるとかでしょうか。これは、形態素解析と組み合わせるのでしょうか。面白そうな分野ではありますね。
Preprocessing：Feature extraction and normalization.
preprocessing, feature extraction.
Transforming input data such as text for use with machine learning algorithms.

ということで、Classification、 Regression、Clusteringは割と一般的なモデリングだと思います。

初期は辛いですね。
とにかく量をこなさないと、どうやって学習していったらいいかすらわかりません。

scikit-learnを学ぼう

ホームページから見ていきます
http://scikit-learn.org/stable/

-Simple and efficient tools for data mining and data analysis
-Accessible to everybody, and reusable in various contexts
-Built on NumPy, SciPy, and matplotlib
-Open source, commercially usable – BSD license

BSD license　ってあまりみませんね。
BSD license:カリフォルニア大学によって策定され、同大学のバークレー校内の研究グループ、Computer Systems Research Groupが開発したソフトウェア群であるBSDなどで採用されている。「無保証」であることの明記と著作権およびライセンス条文自身の表示を再頒布の条件とするライセンス規定である。この条件さえ満たせば、BSDライセンスのソースコードを複製・改変して作成したオブジェクトコードをソースコードを公開せずに頒布できる。

著作権・免責の表示が必要ということですね。

sklearn import pandas

import pandas as pd
from sklearn import svm, metrics

xor_input = [
	[0, 0, 0],
	[0, 1, 1],
	[1, 0, 1],
	[1, 1, 0]
]

xor_df = pd.DataFrame(xor_input)
xor_data = xor_df.ix[:,0:1]
xor_label = xor_df.ix[:,2]

clf = svm.SVC()
clf.fit(xor_data, xor_label)
pre = clf.predict(xor_data)

ac_score = metrics.accuracy_score(xor_label, pre)
print(" 正解率=", ac_score)

[vagrant@localhost python]$ python3 app.py
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
正解率= 1.0

こりゃあかん、プログラミングやり始めた時と全く同じだ。
何故こう動いているか理解できん。

scikit-learnのfit()メソッドを使う

いきなりコードから始めます。

from sklearn import svm

xor_data = [
		[0, 0, 0],
		[0, 1, 1],
		[1, 0, 1],
		[1, 1, 0]
] 

data = []
label = []
for row in xor_data:
	p = row[0]
	q = row[1]
	r = row[2]
	data.append([p, q])
	label.append(r)

clf = svm.SVC()
clf.fit(data, label)

pre = clf.predict(data)
print("予測結果:", pre)

ok = 0; total = 0
for idx, answer in enumerate(label):
	p = pre[idx]
	if p == answer: ok += 1
	total += 1
print("正解率:", ok, "/", total, "=", ok/total)

続いてコマンドライン
[vagrant@localhost python]$ python3 app.py
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/vagrant/.pyenv/versions/3.5.2/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
予測結果: [0 1 1 0]
正解率: 4 / 4 = 1.0

なんだこりゃ、いきなり一気に難易度が上がった？！
scikit-learn　－＞　Pythonの機械学習ライブラリ

scikit-learnを入れる

いよいよ来ました、pythonで機械学習 ^^
やっとここまできましたね(祝!)　ワクワクします。

とりあえず、scikit-learnを入れます。
[vagrant@localhost python]$ pip3 install -U scikit-learn scipy matplotlib scikit-image

それから、pandasも入れておきましょう。
[vagrant@localhost python]$ pip3 install pandas
Collecting pandas
Downloading https://files.pythonhosted.org/packages/5d/d4/6e9c56a561f1d27407bf29318ca43f36ccaa289271b805a30034eb3a8ec4/pandas-0.23.4-cp35-cp35m-manylinux1_x86_64.whl (8.7MB)
100% |████████████████████████████████| 8.7MB 735kB/s