dbtをBigQueryで使ってみる
TimelineでdbtとBigQueryに関する記事が流れてきて
そもそもdbtってなんやねんってなったので触ってみる
とりあえず公式のTutorialを触ってみる
dbtのインストール
CloudのEditerかCLIでdbt使えるみたいだけど
CLIで触りたいのでローカルにインストールしてみる
$ brew tap fishtown-analytics/dbt
Updating Homebrew...
...
Tapped 82 formulae (116 files, 1.9MB).
$ brew install dbt
Updating Homebrew...
...
so, you have instead: glibtool and glibtoolize.
$
$ dbt --version
installed version: 0.17.2
latest version: 0.17.2
Up to date!
Plugins:
- bigquery: 0.17.2
- snowflake: 0.17.2
- redshift: 0.17.2
- postgres: 0.17.2
プロジェクトを作成
GitHub Repository作成
今回の検証用にdbt Repositoryを作成
dbtプロジェクトを作成
作成したRepositoryをCloneしてくる
(私はghq使ってるので作成したrepositoryをghq getでcloneした)
$ ghq get git@github.com:shintaro-uchiyama/dbt.git
clone ssh://git@github.com/shintaro-uchiyama/dbt.git -> /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt
git clone ssh://git@github.com/shintaro-uchiyama/dbt.git /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt
Cloning into '/Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt'...
warning: You appear to have cloned an empty repository.
$
$ gl [fix/codecov]
$
$ dbt init jaffle-shop
Running with dbt=0.17.2
...
There's a link to our Slack group in the GitHub Readme. Happy modeling!
project設定内容修正
$ git diff dbt_project.yml [master]
diff --git a/jaffle-shop/dbt_project.yml b/jaffle-shop/dbt_project.yml
index 7e1458d..9b4750f 100644
--- a/jaffle-shop/dbt_project.yml
+++ b/jaffle-shop/dbt_project.yml
@@ -2,12 +2,12 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
-name: 'my_new_project'
+name: 'jaffle_shop'
version: '1.0.0'
config-version: 2
# This setting configures which "profile" dbt uses for this project.
-profile: 'default'
+profile: 'jaffle_shop'
# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
@@ -32,7 +32,7 @@ clean-targets: # directories to be removed by `dbt clean`
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
- my_new_project:
+ jaffle_shop:
# Applies to all files under models/example/
example:
materialized: view
BigQueryに接続するための情報設定
defaultしか書いてないので今回のtutorial情報追記する
使用するGCP Projectでサービスアカウント を作成して
BigQuery管理者の役割設定&json KeyをDownloadしておく
$ cp ~/Downloads/[project_name]-xxx.json ~/.dbt/dbt-user-creds.json
$
$ cat ~/.dbt/profiles.yml
# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile
default:
outputs:
dev:
type: redshift
threads: 1
host: 127.0.0.1
port: 5439
user: alice
pass: pa55word
dbname: warehouse
schema: dbt_alice
prod:
type: redshift
threads: 1
host: 127.0.0.1
port: 5439
user: alice
pass: pa55word
dbname: warehouse
schema: analytics
target: dev
jaffle_shop: # this needs to match the profile: in your dbt_project.yml file
target: dev
outputs:
dev:
type: bigquery
method: service-account
keyfile: /Users/s.uchiyama/.dbt/dbt-user-creds.json # replace this with the full path to your keyfile
project: [project id] # Replace this with your project id
dataset: dbt_uchiyama # Replace this with dbt_your_name, e.g. dbt_bob
threads: 1
timeout_seconds: 300
location: US
priority: interactive
BigQuery接続できるか確認
$ gl
$
$ cd jaffle-shop
$ dbt debug +[master]
Running with dbt=0.17.2
dbt version: 0.17.2
python version: 3.8.5
python path: /usr/local/Cellar/dbt/0.17.2_1/libexec/bin/python3.8
os info: macOS-10.15.4-x86_64-i386-64bit
Using profiles.yml file at /Users/s.uchiyama/.dbt/profiles.yml
Using dbt_project.yml file at /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt/jaffle-shop/dbt_project.yml
Configuration:
profiles.yml file [OK found and valid]
dbt_project.yml file [OK found and valid]
Required dependencies:
- git [OK found]
Connection:
method: service-account
database: [project id]
schema: dbt_uchiyama
location: US
priority: interactive
timeout_seconds: 300
maximum_bytes_billed: None
Connection test: OK connection ok
「OK connection ok」てでてる!接続できてそう!
dbt 実行してみる
全く中身把握してないけどステップに実行っぽいコマンド出てきた
$ dbt run
Running with dbt=0.17.2
Found 2 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
21:51:41 | Concurrency: 1 threads (target='dev')
21:51:41 |
21:51:41 | 1 of 2 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
21:51:45 | 1 of 2 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 3.58s]
21:51:45 | 2 of 2 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
21:51:47 | 2 of 2 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 1.73s]
21:51:47 |
21:51:47 | Finished running 1 table model, 1 view model in 9.34s.
Completed successfully
Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
なんか成功した!何が成功したんや!
指定したprojectのBigQueryにデータセットが増えてて
テーブルとビューができとる!何やら実行はうまくできたみたいじゃ!
モデルを作成する
これ進めて行ったら、さっき何が起きたか理解できそう。
とりあえず言われる通り進めてみる
ブランチ作成
$ git checkout -b add-customers-model
sqlファイル追加
実行したいsqlファイルを追加
$ cat models/customers.sql
with customers as (
select
id as customer_id,
first_name,
last_name
from `dbt-tutorial`.jaffle_shop.customers
),
orders as (
select
id as order_id,
user_id as customer_id,
order_date,
status
from `dbt-tutorial`.jaffle_shop.orders
),
customer_orders as (
select
customer_id,
min(order_date) as first_order_date,
max(order_date) as most_recent_order_date,
count(order_id) as number_of_orders
from orders
group by 1
),
final as (
select
customers.customer_id,
customers.first_name,
customers.last_name,
customer_orders.first_order_date,
customer_orders.most_recent_order_date,
coalesce(customer_orders.number_of_orders, 0) as number_of_orders
from customers
left join customer_orders using (customer_id)
)
select * from final
$
$ dbt run
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
22:13:00 | Concurrency: 1 threads (target='dev')
22:13:00 |
22:13:00 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:13:04 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.18s]
22:13:04 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:13:06 | 2 of 3 OK created view model dbt_uchiyama.customers.................. [CREATE VIEW in 1.65s]
22:13:06 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:13:08 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 1.63s]
22:13:08 |
22:13:08 | Finished running 1 table model, 2 view models in 10.37s.
Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
検証
ビューが追加されとる!
materializedを変える
今回ビューが追加されたけど、config変えるだけでテーブル生成もできる
$ git diff dbt_project.yml
diff --git a/jaffle-shop/dbt_project.yml b/jaffle-shop/dbt_project.yml
index 9b4750f..ed209f5 100644
--- a/jaffle-shop/dbt_project.yml
+++ b/jaffle-shop/dbt_project.yml
@@ -33,6 +33,7 @@ clean-targets: # directories to be removed by `dbt clean`
# using the `{{ config(...) }}` macro.
models:
jaffle_shop:
+ materialized: table
# Applies to all files under models/example/
example:
materialized: view
$
$ dbt run +[add-customers-model]
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
22:24:27 | Concurrency: 1 threads (target='dev')
22:24:27 |
22:24:27 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:24:32 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.82s]
22:24:32 | 2 of 3 START table model dbt_uchiyama.customers...................... [RUN]
22:24:40 | 2 of 3 OK created table model dbt_uchiyama.customers................. [CREATE TABLE (100) in 8.12s]
22:24:40 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:24:42 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.01s]
22:24:42 |
22:24:42 | Finished running 2 table models, 1 view model in 17.23s.
Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
BQにテーブルできてた!
sqlに設定書くこともできる!
$ git diff models/customers.sql +[add-customers-model]
diff --git a/jaffle-shop/models/customers.sql b/jaffle-shop/models/customers.sql
index 6874c30..b542329 100644
--- a/jaffle-shop/models/customers.sql
+++ b/jaffle-shop/models/customers.sql
@@ -1,3 +1,9 @@
+{{
+ config(
+ materialized='view'
+ )
+}}
+
with customers as (
$
$ dbt run +[add-customers-model]
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
22:25:49 | Concurrency: 1 threads (target='dev')
22:25:49 |
22:25:49 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:25:52 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 3.03s]
22:25:52 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:25:52 | 2 of 3 ERROR creating view model dbt_uchiyama.customers.............. [ERROR in 0.08s]
22:25:52 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:25:54 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.23s]
22:25:54 |
22:25:54 | Finished running 1 table model, 2 view models in 6.82s.
Completed with 1 error and 0 warnings:
Compilation Error in model customers (models/customers.sql)
Trying to create view `[project id]`.`dbt_uchiyama`.`customers`, but it currently exists as a table. Either drop `[project id]`.`dbt_uchiyama`.`customers` manually, or run dbt with `--full-refresh` and dbt will drop it for you.
> in macro bigquery__handle_existing_table (macros/materializations/view.sql)
> called by macro adapter_macro (macros/adapters/common.sql)
> called by macro handle_existing_table (macros/materializations/view/create_or_replace_view.sql)
> called by macro create_or_replace_view (macros/materializations/view/create_or_replace_view.sql)
> called by macro materialization_view_bigquery (macros/materializations/view.sql)
> called by model customers (models/customers.sql)
Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3
$
$ # tableがすでにいるからviewができないみたい
$ # --full-refresh付けたらできるみたい
$ dbt run --full-refresh
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
22:31:53 | Concurrency: 1 threads (target='dev')
22:31:53 |
22:31:53 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:31:57 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.09s]
22:31:57 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:31:59 | 2 of 3 OK created view model dbt_uchiyama.customers.................. [CREATE VIEW in 2.32s]
22:31:59 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:32:02 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.38s]
22:32:02 |
22:32:02 | Finished running 1 table model, 2 view models in 11.03s.
Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
$ # でけた!!
sql変えずに設定(yaml)ファイル変えるだけでいろいろ変えれそうでステキ
withで書いたCTEをファイル分割できる
上記1ファイルにwith句書かれてたけどファイル分割して実行できるみたい
$ cat models/stg_customers.sql
select
id as customer_id,
first_name,
last_name
from `dbt-tutorial`.jaffle_shop.customers
$
$ cat models/stg_orders.sql
select
id as order_id,
user_id as customer_id,
order_date,
status
from `dbt-tutorial`.jaffle_shop.orders
$
$ git diff models/customers.sql
diff --git a/jaffle-shop/models/customers.sql b/jaffle-shop/models/customers.sql
index b542329..2a0f64b 100644
--- a/jaffle-shop/models/customers.sql
+++ b/jaffle-shop/models/customers.sql
@@ -1,29 +1,12 @@
-{{
- config(
- materialized='view'
- )
-}}
-
with customers as (
- select
- id as customer_id,
- first_name,
- last_name
-
- from `dbt-tutorial`.jaffle_shop.customers
+ select * from {{ ref('stg_customers') }}
),
orders as (
- select
- id as order_id,
- user_id as customer_id,
- order_date,
- status
-
- from `dbt-tutorial`.jaffle_shop.orders
+ select * from {{ ref('stg_orders') }}
),
$
$ dbt run
Running with dbt=0.17.2
Found 3 models, 0 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
22:44:49 | Concurrency: 1 threads (target='dev')
22:44:49 |
22:44:49 | 1 of 3 START table model dbt_uchiyama.stg_customers.................. [RUN]
22:44:53 | 1 of 3 OK created table model dbt_uchiyama.stg_customers............. [CREATE TABLE (100) in 3.90s]
22:44:53 | 2 of 3 START table model dbt_uchiyama.stg_orders..................... [RUN]
22:44:58 | 2 of 3 OK created table model dbt_uchiyama.stg_orders................ [CREATE TABLE (99) in 4.50s]
22:44:58 | 3 of 3 START table model dbt_uchiyama.customers...................... [RUN]
22:45:02 | 3 of 3 OK created table model dbt_uchiyama.customers................. [CREATE TABLE (100) in 4.13s]
22:45:02 |
22:45:02 | Finished running 3 table models in 14.96s.
Completed successfully
Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
テーブルできとる!
with句で指定してたsqlの再利用性高くなるし
設定ファイル等で明示的に依存関係定義しなくていいので素敵!
refで参照するファイル名間違えてたりしたら、エラー出してくれたり
--modelsオプションで実行対象絞ったりいろいろできそう
テストとドキュメント
sqlのテスト書けるっぽい。素敵!
schemaファイルを定義してテスト実行
$ cat models/schema.yml
version: 2
models:
- name: customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_customers
columns:
- name: customer_id
tests:
- unique
- not_null
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: status
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
$
$ dbt test
Running with dbt=0.17.2
Found 3 models, 9 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
23:27:02 | Concurrency: 1 threads (target='dev')
23:27:02 |
23:27:02 | 1 of 9 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [RUN]
23:27:05 | 1 of 9 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [PASS in 2.88s]
23:27:05 | 2 of 9 START test not_null_customers_customer_id..................... [RUN]
23:27:07 | 2 of 9 PASS not_null_customers_customer_id........................... [PASS in 1.93s]
23:27:07 | 3 of 9 START test not_null_stg_customers_customer_id................. [RUN]
23:27:09 | 3 of 9 PASS not_null_stg_customers_customer_id....................... [PASS in 1.87s]
23:27:09 | 4 of 9 START test not_null_stg_orders_customer_id.................... [RUN]
23:27:12 | 4 of 9 PASS not_null_stg_orders_customer_id.......................... [PASS in 3.07s]
23:27:12 | 5 of 9 START test not_null_stg_orders_order_id....................... [RUN]
23:27:15 | 5 of 9 PASS not_null_stg_orders_order_id............................. [PASS in 3.06s]
23:27:15 | 6 of 9 START test relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [RUN]
23:27:18 | 6 of 9 PASS relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [PASS in 3.07s]
23:27:18 | 7 of 9 START test unique_customers_customer_id....................... [RUN]
23:27:22 | 7 of 9 PASS unique_customers_customer_id............................. [PASS in 3.38s]
23:27:22 | 8 of 9 START test unique_stg_customers_customer_id................... [RUN]
23:27:24 | 8 of 9 PASS unique_stg_customers_customer_id......................... [PASS in 2.36s]
23:27:24 | 9 of 9 START test unique_stg_orders_order_id......................... [RUN]
23:27:27 | 9 of 9 PASS unique_stg_orders_order_id............................... [PASS in 2.87s]
23:27:27 |
23:27:27 | Finished running 9 tests in 26.70s.
Completed successfully
Done. PASS=9 WARN=0 ERROR=0 SKIP=0 TOTAL=9
unique, not null, テーブル間のrelation等テストが明示できる素敵!
ドキュメンテーション
ドキュメンテーションもできるみたい。
$ git diff models/schema.yml
diff --git a/jaffle-shop/models/schema.yml b/jaffle-shop/models/schema.yml
index f90cf41..5c72692 100644
--- a/jaffle-shop/models/schema.yml
+++ b/jaffle-shop/models/schema.yml
@@ -2,22 +2,30 @@ version: 2
models:
- name: customers
+ description: One record per customer
columns:
- name: customer_id
+ description: Primary key
tests:
- unique
- not_null
+ - name: first_order_date
+ description: NULL when a customer has not yet placed an order.
- name: stg_customers
+ description: This model cleans up customer data
columns:
- name: customer_id
+ description: Primary key
tests:
- unique
- not_null
- name: stg_orders
+ description: This model cleans up order data
columns:
- name: order_id
+ description: Primary key
tests:
- unique
- not_null
@@ -25,10 +33,3 @@ models:
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
- - name: customer_id
- tests:
- - not_null
- - relationships:
- to: ref('stg_customers')
- field: customer_id
-
$
$ dbt docs generate
Running with dbt=0.17.2
Found 3 models, 7 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources
23:32:52 | Concurrency: 1 threads (target='dev')
23:32:52 |
23:32:52 | Done.
23:32:52 | Building catalog
23:32:56 | Catalog written to /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt/jaffle-shop/target/catalog.json
$
$ dbt docs serve +[add-customers-model]
Running with dbt=0.17.2
Serving docs at 0.0.0.0:8080
To access from your browser, navigate to: http://localhost:8080
Press Ctrl+C to exit.
http://localhost:8080 にアクセスすると素敵なドキュメントが見れる!
デプロイ
GitHubのPRやスケジューリングを契機に
テスト実行して、対象のBQデータセットデプロイできるみたいやってみる!
dbtアカウント作る
まずはアカウント作る
言われた通り進めていき、作成したGitHubリポジトリ指定すると
最終的にこんな感じのHome画面にたどり着ける
Environment作成
適当な情報入れてEnvironement作成
master branchへのPRでjobを実行したいので custom branchにmaster指定
Jobの作成
適当な情報入れてjob作成
Triggerでmaster branchへのPR作成にする!
PR作成する
GitHubリポジトリと連携がうまく行っていると
jobが走り出してくれる!がこけた。
ローカルではできたのに、deploymentではうまくいかない・・・
dbt_project.ymlファイルが見つけられてないのかと思って
dbt run --project-dir jaffle-shopとかオプション指定してみたが変わらず
公式の動画みてみるとjaffle-shopディレクトリとかなくて
リポジトリ直下にdbt_project.yml配置の構図になってるので真似してみる
成功した!ここだいぶ詰まった。
ディレクトリ深くしたらどうなるの説あるけど、一旦動いたからここまでにしとこう👶
生成されたドキュメントとかみると
依存関係も明示されてたりで素敵
料金
一人で開発してる分には無料なのかしら。
それ以上の場合1人/月50ドルみたい
https://www.getdbt.com/pricing/
まとめ
with句分離して、テストもできるし、GitHubとの連携、ドキュメントも綺麗でかなり使えそう
あとはお金との相談🌝