Who we are

Contacts

1815 W 14th St, Houston, TX 77008

281-817-6190

Databricks Python

DBT and Databricks Part 1: Setting up DBT profile for connecting to Azure Databricks using…

Photo by Art Institute of Chicago on Unsplash

This series of blog posts will illustrate how to use DBT with Azure Databricks: set up a connection profile, work with python models, and copy noSQL data into Databricks(from MongoDB). In the first part, we will talk about how to set up a profile when using dbt-databricks python package.

Install python package dbt-databricks using pip installor add to requirements.txt file. This package requires DBT profile (profiles.yml) to be created with all the credentials as described here:

This profile can be created automatically using dbt init command or manually. We use both: SQL (running on Databricks SQL Warehouse) and pyspark (running on Databricks cluster) DBT models. So, we will configure our profile against Databricks SQL endpoint and add cluster configuration to dbt_project.yml so new clusters are created for each model(discussed in the next part).

We will be using environment variables to update credentials in profiles.yml, so it will look like this:

Creating DBT profile:

  1. Create SQL Warehouse in Databricks and click Connection details

2. Copy the value of the Server hostname; it will be host in DBT profile (or DBT_DB_HOST env variable)

3. Copy the HTTP path; it will be http_path in DBT profile (or DBT_DB_PATH environment variable).

4. Set up database(schema) and catalog names, where DBT will save the results.

5. Token(DBT_DB_TOKEN env variable):

dbt-databricks is looking for a Personal Access Token(PAT) for this value. PAT tokens can be easily created for Databricks users, however, the best practice is to use Service Principals to run Databricks DBT models. Creating a Service Principal for DBT and PAT tokens is a bit trickier process. Let’s go through it step by step:

Create Service Principal for DBT:

Follow the process described here:

1. Go to Microsoft Entra ID(former Active Directory)-> App registrations and Click + New registration.

2. For Name, enter a name for the application.

3. In the Supported account types section, select Accounts in this organizational directory only (Single tenant).

4. Click Register.

5. Click on the newly created Application and choose Certificates and Secrets:

6. On the Client secrets tab, click New client secret.

7. Add Description and Expiry date and click Add.

8. Save secret value, click on Overview, and save the Application (client) ID and Directory (tenant) ID for this Application:

9. Go to Databricks, click on the menu in the top right corner → Admin settingsService Principles (this process is different if identity federation is enabled; please check the reference).

10. Click Add Service principalAdd New.

11. Enter the name of the service principal and the application’s Application (client) ID saved in step 8.

Grant DBT Service Principal all needed permissions to manage Databricks resources (out of scope of this article)

Create Entra ID access token for DBT Service Principal:

Follow the process as described here:

  1. Run the following commands to get the Microsoft Entra ID access token (replace with Application (client) ID saved in step 8, replace with Directory (tenant) ID saved in step 8, replace with the secret value saved in step 8) :
curl -X POST -H 'Content-Type: application/x-www-form-urlencoded' \
https://login.microsoftonline.com//oauth2/v2.0/token \
-d 'client_id=' \
-d 'grant_type=client_credentials' \
-d 'scope=2ff814a6-3304-4ab8-85cb-cd0e6f879c1d%2F.default' \
-d 'client_secret='

2. The Microsoft Entra ID access token is in the access_token value within the output of the call. This token expires after only 60 minutes. It can be used for DBT but needs to be refreshed every time before running DBT and will not allow any operation that lasts more than 60 minutes.

Create Databricks PAT token for DBT Service Principal

As mentioned at the bottom of this document, The Microsoft Entra ID access token can be used to call the Tokens API to create a Databricks PAT token for the Service Principal.

  1. First, allow Service Principals to create PAT tokens as described here:

2. In the Databricks right top menu, click on Admin SettingsWorkspace SettingsPersonal Access TokensPermission settings and add Service Principals:

3. Add Can Use permission for the Service Principal we are interested in.

4. Run the following command to generate the PAT token (

  • replace with the Entra ID access token we just generated,
  • replace with the name of your Databricks deployment, for example, XXXXXX.azuredatabricks.net,
  • set a lifetime for a token in seconds, or if the lifetime is not specified, this token remains valid indefinitely):
curl --request POST "https:///api/2.0/token/create" \
 --header "Authorization: Bearer " \
 --data '{ "lifetime_seconds": X, "comment": "DBT Service Principal access token"}'

5. This command will generate a PAT Token for the service principal, and it can be saved as a token (DBT_DB_TOKEN env variable in DBT profile).

The DBT profile is ready and can be used to run DBT on Azure Databricks!