dbtをBigQueryで使ってみる

TimelineでdbtとBigQueryに関する記事が流れてきて
そもそもdbtってなんやねんってなったので触ってみる
とりあえず公式のTutorialを触ってみる

dbtのインストール

CloudのEditerかCLIでdbt使えるみたいだけど
CLIで触りたいのでローカルにインストールしてみる

$ brew tap fishtown-analytics/dbt
Updating Homebrew...
...
Tapped 82 formulae (116 files, 1.9MB).
$ brew install dbt
Updating Homebrew...
...
so, you have instead: glibtool and glibtoolize.
$
$ dbt --version
installed version: 0.17.2
  latest version: 0.17.2

Up to date!

Plugins:
 - bigquery: 0.17.2
 - snowflake: 0.17.2
 - redshift: 0.17.2
 - postgres: 0.17.2

プロジェクトを作成

GitHub Repository作成
今回の検証用にdbt Repositoryを作成

dbtプロジェクトを作成
作成したRepositoryをCloneしてくる
(私はghq使ってるので作成したrepositoryをghq getでcloneした)

$ ghq get git@github.com:shintaro-uchiyama/dbt.git
    clone ssh://git@github.com/shintaro-uchiyama/dbt.git -> /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt
      git clone ssh://git@github.com/shintaro-uchiyama/dbt.git /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt
Cloning into '/Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt'...
warning: You appear to have cloned an empty repository.
$
$ gl                                                                                                                                                                               [fix/codecov]
$
$ dbt init jaffle-shop
Running with dbt=0.17.2
...
There's a link to our Slack group in the GitHub Readme. Happy modeling!

project設定内容修正

$ git diff dbt_project.yml                                                                                                                                                                                 [master]
diff --git a/jaffle-shop/dbt_project.yml b/jaffle-shop/dbt_project.yml
index 7e1458d..9b4750f 100644
--- a/jaffle-shop/dbt_project.yml
+++ b/jaffle-shop/dbt_project.yml
@@ -2,12 +2,12 @@
# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
-name: 'my_new_project'
+name: 'jaffle_shop'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
-profile: 'default'
+profile: 'jaffle_shop'

# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
@@ -32,7 +32,7 @@ clean-targets:         # directories to be removed by `dbt clean`
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
-  my_new_project:
+  jaffle_shop:
      # Applies to all files under models/example/
      example:
          materialized: view

BigQueryに接続するための情報設定
defaultしか書いてないので今回のtutorial情報追記する
使用するGCP Projectでサービスアカウント を作成して
BigQuery管理者の役割設定&json KeyをDownloadしておく

$ cp ~/Downloads/[project_name]-xxx.json ~/.dbt/dbt-user-creds.json
$
$ cat ~/.dbt/profiles.yml

# For more information on how to configure this file, please see:
# https://docs.getdbt.com/docs/profile

default:
 outputs:
   dev:
     type: redshift
     threads: 1
     host: 127.0.0.1
     port: 5439
     user: alice
     pass: pa55word
     dbname: warehouse
     schema: dbt_alice
   prod:
     type: redshift
     threads: 1
     host: 127.0.0.1
     port: 5439
     user: alice
     pass: pa55word
     dbname: warehouse
     schema: analytics
 target: dev
jaffle_shop: # this needs to match the profile: in your dbt_project.yml file
 target: dev
 outputs:
   dev:
     type: bigquery
     method: service-account
     keyfile: /Users/s.uchiyama/.dbt/dbt-user-creds.json # replace this with the full path to your keyfile
     project: [project id] # Replace this with your project id
     dataset: dbt_uchiyama # Replace this with dbt_your_name, e.g. dbt_bob
     threads: 1
     timeout_seconds: 300
     location: US
     priority: interactive
     

BigQuery接続できるか確認

$ gl
$
$ cd jaffle-shop
$ dbt debug                                                                                                                                                                            +[master]
Running with dbt=0.17.2
dbt version: 0.17.2
python version: 3.8.5
python path: /usr/local/Cellar/dbt/0.17.2_1/libexec/bin/python3.8
os info: macOS-10.15.4-x86_64-i386-64bit
Using profiles.yml file at /Users/s.uchiyama/.dbt/profiles.yml
Using dbt_project.yml file at /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt/jaffle-shop/dbt_project.yml

Configuration:
 profiles.yml file [OK found and valid]
 dbt_project.yml file [OK found and valid]

Required dependencies:
- git [OK found]

Connection:
 method: service-account
 database: [project id]
 schema: dbt_uchiyama
 location: US
 priority: interactive
 timeout_seconds: 300
 maximum_bytes_billed: None
 Connection test: OK connection ok

「OK connection ok」てでてる!接続できてそう!

dbt 実行してみる
全く中身把握してないけどステップに実行っぽいコマンド出てきた

$ dbt run
Running with dbt=0.17.2
Found 2 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

21:51:41 | Concurrency: 1 threads (target='dev')
21:51:41 |
21:51:41 | 1 of 2 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
21:51:45 | 1 of 2 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 3.58s]
21:51:45 | 2 of 2 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
21:51:47 | 2 of 2 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 1.73s]
21:51:47 |
21:51:47 | Finished running 1 table model, 1 view model in 9.34s.

Completed successfully

Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2

なんか成功した!何が成功したんや!

スクリーンショット 2020-08-01 21.56.39

指定したprojectのBigQueryにデータセットが増えてて
テーブルとビューができとる!何やら実行はうまくできたみたいじゃ!

モデルを作成する

これ進めて行ったら、さっき何が起きたか理解できそう。
とりあえず言われる通り進めてみる

ブランチ作成

$ git checkout -b add-customers-model

sqlファイル追加
実行したいsqlファイルを追加

$ cat models/customers.sql
with customers as (

   select
       id as customer_id,
       first_name,
       last_name

   from `dbt-tutorial`.jaffle_shop.customers

),

orders as (

   select
       id as order_id,
       user_id as customer_id,
       order_date,
       status

   from `dbt-tutorial`.jaffle_shop.orders

),

customer_orders as (

   select
       customer_id,

       min(order_date) as first_order_date,
       max(order_date) as most_recent_order_date,
       count(order_id) as number_of_orders

   from orders

   group by 1

),


final as (

   select
       customers.customer_id,
       customers.first_name,
       customers.last_name,
       customer_orders.first_order_date,
       customer_orders.most_recent_order_date,
       coalesce(customer_orders.number_of_orders, 0) as number_of_orders

   from customers

   left join customer_orders using (customer_id)

)

select * from final
$
$ dbt run
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

22:13:00 | Concurrency: 1 threads (target='dev')
22:13:00 |
22:13:00 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:13:04 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.18s]
22:13:04 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:13:06 | 2 of 3 OK created view model dbt_uchiyama.customers.................. [CREATE VIEW in 1.65s]
22:13:06 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:13:08 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 1.63s]
22:13:08 |
22:13:08 | Finished running 1 table model, 2 view models in 10.37s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

検証

スクリーンショット 2020-08-01 22.15.56

ビューが追加されとる!

materializedを変える
今回ビューが追加されたけど、config変えるだけでテーブル生成もできる

 $ git diff dbt_project.yml
diff --git a/jaffle-shop/dbt_project.yml b/jaffle-shop/dbt_project.yml
index 9b4750f..ed209f5 100644
--- a/jaffle-shop/dbt_project.yml
+++ b/jaffle-shop/dbt_project.yml
@@ -33,6 +33,7 @@ clean-targets:         # directories to be removed by `dbt clean`
# using the `{{ config(...) }}` macro.
models:
  jaffle_shop:
+      materialized: table
      # Applies to all files under models/example/
      example:
          materialized: view
$
$ dbt run                                                                                                                                                                 +[add-customers-model]
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

22:24:27 | Concurrency: 1 threads (target='dev')
22:24:27 |
22:24:27 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:24:32 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.82s]
22:24:32 | 2 of 3 START table model dbt_uchiyama.customers...................... [RUN]
22:24:40 | 2 of 3 OK created table model dbt_uchiyama.customers................. [CREATE TABLE (100) in 8.12s]
22:24:40 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:24:42 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.01s]
22:24:42 |
22:24:42 | Finished running 2 table models, 1 view model in 17.23s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

BQにテーブルできてた!
sqlに設定書くこともできる!

$ git diff models/customers.sql                                                                                                                                           +[add-customers-model]
diff --git a/jaffle-shop/models/customers.sql b/jaffle-shop/models/customers.sql
index 6874c30..b542329 100644
--- a/jaffle-shop/models/customers.sql
+++ b/jaffle-shop/models/customers.sql
@@ -1,3 +1,9 @@
+{{
+  config(
+    materialized='view'
+  )
+}}
+
with customers as (
$
$ dbt run                                                                                                                                                                 +[add-customers-model]
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

22:25:49 | Concurrency: 1 threads (target='dev')
22:25:49 |
22:25:49 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:25:52 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 3.03s]
22:25:52 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:25:52 | 2 of 3 ERROR creating view model dbt_uchiyama.customers.............. [ERROR in 0.08s]
22:25:52 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:25:54 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.23s]
22:25:54 |
22:25:54 | Finished running 1 table model, 2 view models in 6.82s.

Completed with 1 error and 0 warnings:

Compilation Error in model customers (models/customers.sql)
 Trying to create view `[project id]`.`dbt_uchiyama`.`customers`, but it currently exists as a table. Either drop `[project id]`.`dbt_uchiyama`.`customers` manually, or run dbt with `--full-refresh` and dbt will drop it for you.

 > in macro bigquery__handle_existing_table (macros/materializations/view.sql)
 > called by macro adapter_macro (macros/adapters/common.sql)
 > called by macro handle_existing_table (macros/materializations/view/create_or_replace_view.sql)
 > called by macro create_or_replace_view (macros/materializations/view/create_or_replace_view.sql)
 > called by macro materialization_view_bigquery (macros/materializations/view.sql)
 > called by model customers (models/customers.sql)

Done. PASS=2 WARN=0 ERROR=1 SKIP=0 TOTAL=3
$ 
$ # tableがすでにいるからviewができないみたい
$ # --full-refresh付けたらできるみたい
$ dbt run --full-refresh
Running with dbt=0.17.2
Found 3 models, 4 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

22:31:53 | Concurrency: 1 threads (target='dev')
22:31:53 |
22:31:53 | 1 of 3 START table model dbt_uchiyama.my_first_dbt_model............. [RUN]
22:31:57 | 1 of 3 OK created table model dbt_uchiyama.my_first_dbt_model........ [CREATE TABLE (2) in 4.09s]
22:31:57 | 2 of 3 START view model dbt_uchiyama.customers....................... [RUN]
22:31:59 | 2 of 3 OK created view model dbt_uchiyama.customers.................. [CREATE VIEW in 2.32s]
22:31:59 | 3 of 3 START view model dbt_uchiyama.my_second_dbt_model............. [RUN]
22:32:02 | 3 of 3 OK created view model dbt_uchiyama.my_second_dbt_model........ [CREATE VIEW in 2.38s]
22:32:02 |
22:32:02 | Finished running 1 table model, 2 view models in 11.03s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3
$ # でけた!!

sql変えずに設定(yaml)ファイル変えるだけでいろいろ変えれそうでステキ

withで書いたCTEをファイル分割できる
上記1ファイルにwith句書かれてたけどファイル分割して実行できるみたい

$ cat models/stg_customers.sql
select
   id as customer_id,
   first_name,
   last_name

from `dbt-tutorial`.jaffle_shop.customers
$
$ cat models/stg_orders.sql
select
   id as order_id,
   user_id as customer_id,
   order_date,
   status

from `dbt-tutorial`.jaffle_shop.orders
$
$ git diff models/customers.sql
diff --git a/jaffle-shop/models/customers.sql b/jaffle-shop/models/customers.sql
index b542329..2a0f64b 100644
--- a/jaffle-shop/models/customers.sql
+++ b/jaffle-shop/models/customers.sql
@@ -1,29 +1,12 @@
-{{
-  config(
-    materialized='view'
-  )
-}}
-
with customers as (

-    select
-        id as customer_id,
-        first_name,
-        last_name
-
-    from `dbt-tutorial`.jaffle_shop.customers
+    select * from {{ ref('stg_customers') }}

),

orders as (

-    select
-        id as order_id,
-        user_id as customer_id,
-        order_date,
-        status
-
-    from `dbt-tutorial`.jaffle_shop.orders
+    select * from {{ ref('stg_orders') }}

),
$
$ dbt run
Running with dbt=0.17.2
Found 3 models, 0 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

22:44:49 | Concurrency: 1 threads (target='dev')
22:44:49 |
22:44:49 | 1 of 3 START table model dbt_uchiyama.stg_customers.................. [RUN]
22:44:53 | 1 of 3 OK created table model dbt_uchiyama.stg_customers............. [CREATE TABLE (100) in 3.90s]
22:44:53 | 2 of 3 START table model dbt_uchiyama.stg_orders..................... [RUN]
22:44:58 | 2 of 3 OK created table model dbt_uchiyama.stg_orders................ [CREATE TABLE (99) in 4.50s]
22:44:58 | 3 of 3 START table model dbt_uchiyama.customers...................... [RUN]
22:45:02 | 3 of 3 OK created table model dbt_uchiyama.customers................. [CREATE TABLE (100) in 4.13s]
22:45:02 |
22:45:02 | Finished running 3 table models in 14.96s.

Completed successfully

Done. PASS=3 WARN=0 ERROR=0 SKIP=0 TOTAL=3

スクリーンショット 2020-08-01 23.03.14

テーブルできとる!
with句で指定してたsqlの再利用性高くなるし
設定ファイル等で明示的に依存関係定義しなくていいので素敵!

refで参照するファイル名間違えてたりしたら、エラー出してくれたり
--modelsオプションで実行対象絞ったりいろいろできそう

テストとドキュメント

sqlのテスト書けるっぽい。素敵!
schemaファイルを定義してテスト実行

$ cat models/schema.yml
version: 2

models:
 - name: customers
   columns:
     - name: customer_id
       tests:
         - unique
         - not_null

 - name: stg_customers
   columns:
     - name: customer_id
       tests:
         - unique
         - not_null

 - name: stg_orders
   columns:
     - name: order_id
       tests:
         - unique
         - not_null
     - name: status
       tests:
         - accepted_values:
             values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
     - name: customer_id
       tests:
         - not_null
         - relationships:
             to: ref('stg_customers')
             field: customer_id
$
$ dbt test
Running with dbt=0.17.2
Found 3 models, 9 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

23:27:02 | Concurrency: 1 threads (target='dev')
23:27:02 |
23:27:02 | 1 of 9 START test accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [RUN]
23:27:05 | 1 of 9 PASS accepted_values_stg_orders_status__placed__shipped__completed__return_pending__returned [PASS in 2.88s]
23:27:05 | 2 of 9 START test not_null_customers_customer_id..................... [RUN]
23:27:07 | 2 of 9 PASS not_null_customers_customer_id........................... [PASS in 1.93s]
23:27:07 | 3 of 9 START test not_null_stg_customers_customer_id................. [RUN]
23:27:09 | 3 of 9 PASS not_null_stg_customers_customer_id....................... [PASS in 1.87s]
23:27:09 | 4 of 9 START test not_null_stg_orders_customer_id.................... [RUN]
23:27:12 | 4 of 9 PASS not_null_stg_orders_customer_id.......................... [PASS in 3.07s]
23:27:12 | 5 of 9 START test not_null_stg_orders_order_id....................... [RUN]
23:27:15 | 5 of 9 PASS not_null_stg_orders_order_id............................. [PASS in 3.06s]
23:27:15 | 6 of 9 START test relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [RUN]
23:27:18 | 6 of 9 PASS relationships_stg_orders_customer_id__customer_id__ref_stg_customers_ [PASS in 3.07s]
23:27:18 | 7 of 9 START test unique_customers_customer_id....................... [RUN]
23:27:22 | 7 of 9 PASS unique_customers_customer_id............................. [PASS in 3.38s]
23:27:22 | 8 of 9 START test unique_stg_customers_customer_id................... [RUN]
23:27:24 | 8 of 9 PASS unique_stg_customers_customer_id......................... [PASS in 2.36s]
23:27:24 | 9 of 9 START test unique_stg_orders_order_id......................... [RUN]
23:27:27 | 9 of 9 PASS unique_stg_orders_order_id............................... [PASS in 2.87s]
23:27:27 |
23:27:27 | Finished running 9 tests in 26.70s.

Completed successfully

Done. PASS=9 WARN=0 ERROR=0 SKIP=0 TOTAL=9

unique, not null, テーブル間のrelation等テストが明示できる素敵!

ドキュメンテーション

ドキュメンテーションもできるみたい。

$ git diff models/schema.yml
diff --git a/jaffle-shop/models/schema.yml b/jaffle-shop/models/schema.yml
index f90cf41..5c72692 100644
--- a/jaffle-shop/models/schema.yml
+++ b/jaffle-shop/models/schema.yml
@@ -2,22 +2,30 @@ version: 2

models:
  - name: customers
+    description: One record per customer
    columns:
      - name: customer_id
+        description: Primary key
        tests:
          - unique
          - not_null
+      - name: first_order_date
+        description: NULL when a customer has not yet placed an order.

  - name: stg_customers
+    description: This model cleans up customer data
    columns:
      - name: customer_id
+        description: Primary key
        tests:
          - unique
          - not_null

  - name: stg_orders
+    description: This model cleans up order data
    columns:
      - name: order_id
+        description: Primary key
        tests:
          - unique
          - not_null
@@ -25,10 +33,3 @@ models:
        tests:
          - accepted_values:
              values: ['placed', 'shipped', 'completed', 'return_pending', 'returned']
-      - name: customer_id
-        tests:
-          - not_null
-          - relationships:
-              to: ref('stg_customers')
-              field: customer_id
-
$
$ dbt docs generate
Running with dbt=0.17.2
Found 3 models, 7 tests, 0 snapshots, 0 analyses, 147 macros, 0 operations, 0 seed files, 0 sources

23:32:52 | Concurrency: 1 threads (target='dev')
23:32:52 |
23:32:52 | Done.
23:32:52 | Building catalog
23:32:56 | Catalog written to /Users/s.uchiyama/src/github.com/shintaro-uchiyama/dbt/jaffle-shop/target/catalog.json
$
$ dbt docs serve                                                                                                                                                          +[add-customers-model]
Running with dbt=0.17.2
Serving docs at 0.0.0.0:8080
To access from your browser, navigate to:  http://localhost:8080
Press Ctrl+C to exit.

http://localhost:8080 にアクセスすると素敵なドキュメントが見れる!

画像4

デプロイ

GitHubのPRやスケジューリングを契機に
テスト実行して、対象のBQデータセットデプロイできるみたいやってみる!

dbtアカウント作る
まずはアカウント作る
言われた通り進めていき、作成したGitHubリポジトリ指定すると
最終的にこんな感じのHome画面にたどり着ける

画像5

Environment作成
適当な情報入れてEnvironement作成
master branchへのPRでjobを実行したいので custom branchにmaster指定

画像6

Jobの作成
適当な情報入れてjob作成
Triggerでmaster branchへのPR作成にする!

画像7

PR作成する
GitHubリポジトリと連携がうまく行っていると
jobが走り出してくれる!がこけた。

スクリーンショット 2020-08-02 10.10.52

ローカルではできたのに、deploymentではうまくいかない・・・
dbt_project.ymlファイルが見つけられてないのかと思って
dbt run --project-dir jaffle-shopとかオプション指定してみたが変わらず

公式の動画みてみるとjaffle-shopディレクトリとかなくて
リポジトリ直下にdbt_project.yml配置の構図になってるので真似してみる

スクリーンショット 2020-08-02 10.16.49

成功した!ここだいぶ詰まった。
ディレクトリ深くしたらどうなるの説あるけど、一旦動いたからここまでにしとこう👶

生成されたドキュメントとかみると
依存関係も明示されてたりで素敵

スクリーンショット 2020-08-02 10.23.36

料金

一人で開発してる分には無料なのかしら。
それ以上の場合1人/月50ドルみたい
https://www.getdbt.com/pricing/

まとめ  

with句分離して、テストもできるし、GitHubとの連携、ドキュメントも綺麗でかなり使えそう
あとはお金との相談🌝

いいなと思ったら応援しよう!