Who we are

Contacts

1815 W 14th St, Houston, TX 77008

281-817-6190

Data Engineering

Running DBT on Databricks while using dbt_external_tables package to utilize Snowflake Tables

This article highlights a specific use case where one might need to run dbt on Databricks while utilizing tables in Snowflake. Typically, dbt runs on top of the database where it is instantiated. However, if a table needed to run dbt in Databricks does not exist in the hive-metastore and instead exists in an external database like Snowflake, then you would have to utilize the dbt_external_tables package to create a copy of the table in Databricks so dbt can run on top of that table in Databricks.

dbt setup

The dbt project directory should be set up like this:

dbt directory setup

Setup your dbt_project.yml file like so:


name: 'databricks_setup'
version: '1.0.0'
config-version: 2

profile: 'databricks_setup'

quoting:
  identifier: false
  schema: false



asset-paths: ["assets"]
model-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
seed-paths: ["data"]
docs-paths: ["models", "data"]
macro-paths: ["macros"]
packages-install-path: "dbt_modules"


models:
  bind: false
  databricks_setup:
    enabled: true
    model_1:
      schema: databricks_setup
      materialized: table
    model_2:
      schema: databricks_setup
      materialized: table

seeds:
  databricks_setup:
      load:
        +schema: databricks_setup

quoting:
  database: true
  schema: false
  identifier: true

Seeds are optional; however, if you have seed files that you’d like to create, then store them in the ‘data’ folder within the ‘dbt’ directory.

Set up your packages.yml file to install the dbt_external_tables package:

packages:
  - package: dbt-labs/dbt_external_tables
    version: 0.8.5

Next, navigate to the schema.yml file within the models/sources/ directory. This is where we will create the connection to the Snowflake tables. The schema.yml file should be organized like this:

version: 2

sources:

  - name: "sample_schema_name"
    tables:
      - name: table_name
        external:
          using: net.snowflake.spark.snowflake
          options:
            sfUrl: [unique_snowflake_url]
            sfUser: "{{ env_var('SPARK_SNOWFLAKE_USER_NAME') }}"
            sfPassword: "{{ env_var('SPARK_SNOWFLAKE_PASSWORD') }}"
            sfDatabase: "{{ env_var('DATABASE_NAME_1') }}"
            sfSchema: "schema_name"
            sfWarehouse: "{{ env_var('SPARK_SNOWFLAKE_WAREHOUSE') }}"
            sfRole: "{{ env_var('SPARK_SNOWFLAKE_ROLE') }}"
            dbtable: "{{ env_var('DATABASE_NAME_1') }}.schema_name.table_name"   

Databricks Setup

Next, we will set up the necessary configuration in the Databricks console. Databricks integrates seamlessly with Github repos. To access your github repositories, it is best to create a Github service user and a Github service PAT (personal access token) that is scoped to only the repos that Databricks needs access to.

Navigate to the ‘Workflows’ tab and select ‘Create job’:

Give your task a name and choose ‘dbt’ as the type:

For ‘Source’, choose ‘Edit’ to set up the GitHub connection:

Fill in the appropriate information (i.e. GitHub repo URL, git provider, git branch) and choose ‘Confirm’.

Choose a SQL Warehouse and a cluster to run your job on:

Depending on your workload, you might want to adjust the size of the SQL Warehouse. This is where the majority of the compute will run. The dbt CLI cluster can be relatively small as it installs the dbt_external_tables package.

Next, you will want to set up your dbt commands:

‘dbt deps’ will install all packages.

‘dbt run-operation stage_external_sources’ will run the dbt_external_tables and create them in Databricks

‘dbt seed’ will run the seed files

‘dbt run’ will run the SQL models in your models/ directory

When your job is done running, verify that the correct tables were created in the Databricks hive-metastore.