Image created with Midjourney by the author.

An Overview of dbt for Beginners

Transform data reliably with SQL and YAML files

Hennie de Harder
12 min readJul 1, 2023

--

Are you often dealing with data that is inconsistent and inaccurate? Are you working on complex data transformation pipelines and looking for ways to track changes to your code and easily roll back to previous versions? Do you want to collaborate on the data pipeline with data engineers, data analysts or other stakeholders?

If your answer to one of the questions above is yes, and you aren’t using Data Build Tool (dbt) yet, this post can help you. It will explain why dbt might be the tool for you and gives a short introduction on how to set it up, different components of dbt and how to schedule dbt jobs.

What is data build tool and how can it help you?

First things first. The modern data stack is a term you might have heard about. In essence, it refers to a collection of tools used to collect, process, store, and analyze data on a cloud-based data platform. It has major benefits: tools are easy to scale, easy to deploy and each tool is a configurable component. Instead of ETL, the data is loaded before it is transformed in a fully automated ELT pipeline. The advantage of doing this is that you only need to transform the data when you are going to use it.

Modern data stack with often used tools. This post focuses on dbt (transformation part). Image by author.

How does dbt relates to the modern data stack? As you can see in the image above, dbt is the go-to tool for the data transformation part of the ELT pipeline. Dbt uses SQL and YAML to define the transformations. It is really easy to use, even if you have limited programming experience. Dbt is designed to be modular and extensible, allowing users to create their own macros, custom scripts, and plugins. This makes it easy to customize dbt to meet specific needs and integrate with other tools in the data stack. Another great part of dbt is that it comes with built-in testing and documentation functionality.

If you are still not convinced using dbt, a big pro of dbt is that it is build with software engineering principles in mind: you can manage your code with a version control system like Git, so it is possible to collaborate easily and test in a safe, isolated environment before deploying to production.

Setting up dbt

How can you set up dbt? Well, it’s easy! Below are the steps you should take to set it up locally:

Step 1. Set up a database you want to use
Dbt can run with many different data platforms, like Snowflake, BigQuery or Postgres. You can try to finish the set up with Snowflake. They have a free trial available. You can create an account on their sign up page. But feel free to use another platform that suits your use case.

Step 2. Create a local folder with the name of the project and install a virtual environment
Run the following lines at the terminal:

mkdir <project_name>
cd <project_name>
virtualenv venv
. venv/bin/activate

Step 3. Install dbt with pip
The next step is to install dbt with pip. If you don’t use Snowflake, but e.g. BigQuery, you have to replace dbt-snowflake with the correct adapter name.

pip install dbt-snowflake

Step 4. Create a dbt project and connect it to Snowflake
Almost done! The last step is to run dbt init <dbt_project_name> to set up the connection.

For this step, you need to provide the account name, database, password, role, schema, threads, user and warehouse. You might need to create a user, warehouse and role if you started with a fresh db.

You can test your connection with the dbt debug command. If it runs without errors you are good to go!

If this is your first project, dbt will create the .dbt folder (usually in the user folder). The .dbt folder contains the profiles.yml file, here you can check the profile and edit it if you like. This file is kept separated from the project, so no passwords will be accidentally send to git.

If you have trouble setting up a correct profile, make sure you created the right user, role and compute warehouse in the database itself, and grant access to the user. Here you can find more info about setting up a profile for dbt and different platforms.

A cloud setup is even easier. Sign up through this link and you can directly start setting up the connection to a database of your choice.

Components of dbt

Everything is set up, and inside your project folder a new folder is created that will contain all the dbt code. Inside it, you find different folders, like analyses, macros, models, and so on.

If you want to work with dbt in a nice and easy way, I recommend to open the folder in VS Code and install the dbt Power User extension.

Folders and files created by dbt. Image by author.

Let’s cover these files and folders one by one.

dbt_project.yml

This YAML file serves as the project-level configuration file. It contains the name of your project, the version, and configurations for models and paths. You can change folder names or paths and you can configure how you want to materialize specific files. There are four materialization types: views, tables, incremental (for fact tables) and ephemeral (for common table expressions).

Below an example of a dbt_project.yml file, that is almost unchanged after initializing the project.

# Name your project! Project names should contain only lowercase characters
# and underscores. A good package name should reflect your organization's
# name or the intended use of these models
name: 'dbt_project_name'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'dbt_project_name'

# These configurations specify where dbt should look for different types of files.
# The `model-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

clean-targets: # directories to be removed by `dbt clean`
- "target"
- "dbt_packages"


# Configuring models
# Full documentation: https://docs.getdbt.com/docs/configuring-models

# In this example config, we tell dbt to build all models by default as views.
# In the dim/ directory as tables. These settings can be overridden in the
# individual model files using the `{{ config(...) }}` macro.
models:
dbt_project_name:
+materialized: view
+post-hook:
- "GRANT SELECT ON {{ this }} TO ROLE REPORTING"
dim:
+materialized: table

models

The models folder is the place where you will define your dbt models. A model is just a SQL file that defines how to transform and build your data. You can create subfolders within this folder to organize your models based on different logical groupings, e.g. to split your dimension and fact models.

Besides folders with SQL files that contain model definitions, you can add YAML files with configurations and properties.

When you run the dbt run command, dbt runs the SQL transformations defined in all your models (also analyses and snapshots) and creates or updates tables or views in your database accordingly.

You can override the materialization directly in the SQL file of the model:

{{ config(materialized='table') }}

WITH raw_stores AS (
SELECT * FROM {{ source('src', 'stores') }}
)

SELECT
id AS store_id,
name AS store_name,
location_id,
is_active,
created_at,
updated_at
FROM
raw_stores

In this example, the code includes Jinja template syntax within the double curly braces ({{ }}). Jinja is a templating engine that allows for dynamic content generation within the SQL code. The configuration option (table) is dynamically set, and the source tabel (named stores) is referenced dynamically.

In the models folder, you will often find YAML files that define the sources or schema. In these files you can specify tests, add documentation or define relationships between tables. Let’s call the file below schema.yml:

version: 2

models:
- name: dim_stores
description: Cleansed table which contains store info.
tests:
- dbt_expectations.expect_table_row_count_to_equal_other_table:
compare_model: source('src', 'stores')
factor: 1
columns:

- name: store_id
description: Primary key for the listing.
tests:
- unique
- not_null

- name: location_id
description: The id of the location. References the locations table.
tests:
- not_null
- relationships:
to: ref('dim_locations')
field: location_id

- name: is_active
description: '{{ doc("dim_stores__is_active") }}'
tests:
- accepted_values:
values: ['true',
'false']

- name: created_at
tests:
- dbt_expectations.expect_column_values_to_be_of_type:
column_type: datetime
config:
severity: warn

- name: dim_locations
description: Cleansed table which contains Airbnb hosts.
columns:

- name: location_id
description: Primary key for the location.
tests:
- not_null
- unique

- name: location_name
description: Name of the location.

- name: country
description: '{{ doc("dim_locations__country") }}'

We define two models (dim_stores and dim_locations), add descriptions of tables and columns (used in the docs) and add tests on table and column level. The test on table level dbt_expectations.expect_table_row_count_to_equal_other_table for dim_stores uses an external package dbt_expectations.

macros

Here you can store custom SQL macros. Macros come in handy when you want to reuse pieces of SQL code. You can invoke them in your dbt models or tests and reuse them. There are many built-in macros in dbt.

Often, macros make use of Jinja templates (as we’ve seen before in the SQL file under models). This is an example of a macro no_negative_values.sql in which a column shouldn’t contain negative values:

{% test no_negative_values(model, column_name) %}
SELECT
*
FROM
{{ model }}
WHERE
{{ column_name }} < 0
{% endtest %}

We’ll get back to this example under tests.

snapshots

Snapshots record changes to a mutable table over time. This is important for “looking back in time” at previous states of the data. In the snapshots folder, you can define the snapshots as select statements and configure them. You can run only snapshots by running dbt snapshot.

In the following example we use the source table stores:

{% snapshot scd_stores %}

{{
config(
target_schema='dev',
unique_key='id',
strategy='timestamp',
updated_at='updated_at',
invalidate_hard_deletes=True
)
}}

SELECT * FROM {{ source('src', 'stores') }}

{% endsnapshot %}

analyses

The analyses folder is used for storing analysis models. These SQL files are not part of the core data transformation process! They are used for ad-hoc analysis or data exploration. So when all you need to do is execute an ad-hoc query, add your SQL file here.

tests

As you might expect, here you can create your tests. Tests help ensure the quality and correctness of your data transformations. A test works by creating a select statement, and dbt expects that the result set should be empty. If it’s not, dbt will raise an error (or warning). You can structure your tests by creating subfolders within this folder.

To execute the macro we created earlier on the stores table and the n_employees column (as example), we create a SQL file in the tests folder:

{{ no_negative_values(model=ref('stores'), column='n_employees') }}

By running dbt test, dbt checks the quality and correctness of your data transformations based on the defined tests and criteria.

This is not the only way to define tests! You can also specify tests in a <filename>.yml file in the models folder, as shown under models in the schema.yml file.

seeds

Within the seeds folder, you can put small data files, e.g. a CSV file, that will be added to the data warehouse. Each seed file typically corresponds to a specific table and contains the data rows to be inserted.

If you want to execute the seeds and populate your tables with initial data, you need to run the dbt seed command.

packages.yml

Often you will need external packages in your dbt projects. You can define them in this YAML file. The dependencies will be installed when you run the dbt deps command. In this hub you can find external packages.

There are different ways to specify packages, depending on where the package is stored:

packages:
- package: calogica/dbt_expectations
version: [">=0.8.0", "<0.9.0"]
- git: "https://github.com/dbt-labs/dbt-utils.git" # git URL
revision: 0.9.2 # tag or branch name

After installation, we can use these packages. You can find two examples in the schema.yml file under models where we use dbt_expectations for tests (one on table level and one on column level).

logs

When you run dbt commands like dbt run and dbt test, dbt will create a log file to provide detailed information about the execution process and any errors or warnings encountered.

target

This folder is created when you run dbt compile or dbt run commands. It contains the compiled SQL files and other artifacts generated by dbt during the execution of your project. It also includes the compiled SQL code for your models.

documentation and assets

Other important topics to mention shortly are assets, in which you can put external files like images to use e.g. for documentation. Talking about documentation, you can specify documentation in YAML files or in standalone markdown files. As an example, you can take a look at the schema.yml file under models. Documentation is defined in the description properties. You can directly type the text in the description field, or choose to put all the documentation in a markdown file:

{% docs dim_stores__is_active %}
Boolean value that shows if a store is currently active.

{% enddocs %}

{% docs dim_locations__country %}
Country of the location.

Currently, we have stores in the following countries:
![colored countries](assets/countries.png)

{% enddocs %}

In the description property of a YAML file you can reference to the first docstring with '{{ doc(“dim_stores__is_active”) }}' (as shown in the schema.yml file). The dim_locations__country example shows a reference to an image in the assets folder with the name countries.png.

hooks

You can also specify hooks in dbt, hooks are SQL statement(s) to be run before or after a model, seed, or snapshot is built. You can configure hooks on project, subfolder or model leven. There are different hook types, like on_run_start, on_run_end, pre-hook and post-hook.

In the dbt_project.yml file you can find an example of a post-hook for the project, where the reporting role is granted read access.

exposures

Exposures are a feature that allows you to expose your dbt models as data tables or views, making them accessible for consumption by downstream applications, tools, or users. Exposures provide a convenient way to share the results of your data transformations with others in a structured and controlled manner.

Scheduling dbt jobs

How can you schedule dbt jobs? Of course, it depends on the infrastructure and tools available in your organization. To give you an idea, here are some common approaches:

  • A simple way to schedule dbt jobs is to use cron jobs. You can create a cron job that runs your dbt command at a specific time or interval.
  • Apache Airflow is a popular open-source platform for managing and scheduling data pipelines. You can use Airflow to create a DAG that includes your dbt job and schedule it.

Other options are Jenkins, AWS Data Pipeline or custom scripts.

Are there any cons of using dbt?

To be honest, it’s quite hard to find convincing reasons for not using dbt. It’s easy to use, open source, and modular. But there are some limitations you should be aware of.

Perhaps the biggest con is the following: dbt is primarily a SQL-based tool. If you’re not comfortable writing SQL queries, you may find it challenging to work with. Especially if you’re new to SQL or the concepts of data modeling and transformation, dbt can be hard. You have to understand the best practices for structuring models, writing efficient SQL queries, and managing dependencies. This can take time and effort.

Besides this, there are some other limitations you should be aware of. There are limitations in versioning: you can utilize external version control systems like Git to manage your dbt project’s code, but there is no native support for tracking changes to individual models or managing conflicts in model definitions. Another limitation has to do with real-time data pipelines. Dbt is designed primarily for batch data processing and is not optimized for real-time data pipelines or streaming data. If you have needs that are less common, you have to create your own workarounds and develop custom integrations.

Conclusion

Dbt is a powerful and user-friendly tool for managing data transformations and building analytics pipelines. Its SQL-centric approach, combined with features like models, tests, macros, and documentation generation, empowers data practitioners to create scalable, maintainable, and reliable data workflows.

You can quickly get started with transforming and modeling data, even without extensive programming knowledge. Its intuitive structure and command-line interface make it accessible to users at various skill levels, enabling them to focus on the logic and quality of their data transformations.

Related

--

--

Hennie de Harder

📈 Data Scientist with a passion for math 💻 Currently working at IKEA and BigData Republic 💡 I share tips & tricks and fun side projects