embulkを触ってみる

in:
  type: file
  path_prefix: "./ratings.csv"
out:
  type: stdout

[vagrant@localhost embulk]$ embulk guess seed.yml -o config.yml
2018-09-30 09:35:40.455 +0900: Embulk v0.9.7
2018-09-30 09:35:43.144 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2018-09-30 09:35:47.529 +0900 [INFO] (main): Gem’s home and path are set by default: “/home/vagrant/.embulk/lib/gems”
2018-09-30 09:35:49.736 +0900 [INFO] (main): Started Embulk v0.9.7
2018-09-30 09:35:49.900 +0900 [INFO] (0001:guess): Listing local files at directory ‘.’ filtering filename by prefix ‘ratings.csv’
2018-09-30 09:35:49.912 +0900 [INFO] (0001:guess): “follow_symlinks” is set false. Note that symbolic links to directories are skipped.
2018-09-30 09:35:49.932 +0900 [INFO] (0001:guess): Loading files [ratings.csv.bak, ratings.csv]
2018-09-30 09:35:49.983 +0900 [INFO] (0001:guess): Try to read 32,768 bytes from input source
2018-09-30 09:35:50.397 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.7)
2018-09-30 09:35:50.469 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.7)
2018-09-30 09:35:50.516 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.7)
2018-09-30 09:35:50.535 +0900 [INFO] (0001:guess): Loaded plugin embulk (0.9.7)
in:
type: file
path_prefix: ./ratings.csv
parser:
charset: UTF-8
newline: LF
type: csv
delimiter: ‘,’
quote: ‘”‘
escape: ‘”‘
trim_if_not_quoted: false
skip_header_lines: 1
allow_extra_columns: false
allow_optional_columns: false
columns:
– {name: id, type: long}
– {name: restaurant_id, type: long}
– {name: user_id, type: string}
– {name: total, type: long}
– {name: food, type: long}
– {name: service, type: long}
– {name: atmosphere, type: long}
– {name: cost_performance, type: long}
– {name: title, type: string}
– {name: body, type: string}
– {name: purpose, type: long}
– {name: created_on, type: timestamp, format: ‘%Y-%m-%d %H:%M:%S’}
out: {type: stdout}

Created ‘config.yml’ file.

[vagrant@localhost embulk]$ embulk preview config.yml
2018-09-30 09:39:04.618 +0900: Embulk v0.9.7
2018-09-30 09:39:07.007 +0900 [WARN] (main): DEPRECATION: JRuby org.jruby.embed.ScriptingContainer is directly injected.
2018-09-30 09:39:11.003 +0900 [INFO] (main): Gem’s home and path are set by default: “/home/vagrant/.embulk/lib/gems”
2018-09-30 09:39:12.432 +0900 [INFO] (main): Started Embulk v0.9.7
2018-09-30 09:39:12.612 +0900 [INFO] (0001:preview): Listing local files at directory ‘.’ filtering filename by prefix ‘ratings.csv’
2018-09-30 09:39:12.614 +0900 [INFO] (0001:preview): “follow_symlinks” is set false. Note that symbolic links to directories are skipped.
2018-09-30 09:39:12.620 +0900 [INFO] (0001:preview): Loading files [ratings.csv.bak, ratings.csv]
2018-09-30 09:39:12.641 +0900 [INFO] (0001:preview): Try to read 32,768 bytes from input source
2018-09-30 09:39:13.337 +0900 [WARN] (0001:preview): Skipped line 48 (Unexpected end of line during parsing a quoted value): 72860,1076,4e11ad7b,1,0,0,0,0,,”なぜあの店はあんなに行列ができるのだろうと、車で通�

なんじゃこりゃーーーーーーーーーーー

Embulkを使ってみる

wget -O test.tar.gz https://github.com/livedoor/datasets/blob/master/ldgourmet.tar.gz?raw=true

[vagrant@localhost embulk]$ ls
test.tar.gz
[vagrant@localhost embulk]$ tar xvfz test.tar.gz
areas.csv
categories.csv
prefs.csv
ratings.csv
rating_votes.csv
restaurants.csv
stations.csv

Embulk

Embulkとは
~Pluggable Bulk Data Loader~
-並列データ転送ツール
-Fluentd開発者 古橋氏が開発
-Fluentdのバッチ版
-プラグインアーキテクチャ

An open-source plugin-based parallel bulk data loader that makes painful data integration work relaxed.
Founder & Software Architect, Treasure Data, inc.

CSV Files, S3, SequenceFile, HDFS, MySQL、Salesforce.com
⇒ bulk load =>
Hive, Elasticsearch, Cassandra, Redis

fluentdはstream、embulkはstorage
巨大データに対応(並列分散処理)
高速性、トランザクション制御
スキーマを使ったバリデーション
実行はコマンド

Input Plugin
RDBS ( mysql, postgres, jdbc… )
NoSQL ( redis, mongodb)
Cloud Service (redshift, s3 )
Files (CSV, JSON …)
Etc ( hdfs, http, elastic search, slack-history, google analitics )

Output Plugin
RDBS ( mysql, postgres, oracle, jdbc…)
Cloud Service ( redshift, s3, bigquery)
NoSQL ( redis, hdfs )
Files
Etc ( elastic search, hdfs, swift)

Filter Plugin
column (カラムを削る)
insert 指定した場所にホスト名などのカラム追加する
row 所定の条件に合致するローのみ抽出する
rearrange 一行のデータを複数行に再構成する

File parser Plugin
json
xml
csv
apache log
query_string
regex

File formatter Plugin
json
レコードの内容をjsonl(1 json 1行)の形式に整形するプラグイン
poi_excel
Excel(xls,xlsx)形式のデータに変換するプラグイン

mapreduce
EmbulkのタスクをHadoop上で実行するためのプラグイン
Executor Plugin

とりあえる、入れます。
[vagrant@localhost embulk]$ brew install embulk
[vagrant@localhost embulk]$ embulk –version
embulk 0.9.7

0.9.7ですね。