Establishing a Connection
With the CData Cmdlets users can install a data module, set the connection properties, and start scripting. This section provides examples of using our Databricks Cmdlets with native PowerShell cmdlets, like the CSV import and export cmdlets.
Connecting to Databricks
To connect to a Databricks cluster, set the following properties:
- Database: The name of the Databricks database.
- Server: The Server Hostname of your Databricks cluster.
- HTTPPath: The HTTP Path of your Databricks cluster.
- Token: Your personal access token. You can obtain this value by navigating to the User Settings page of your Databricks instance and selecting the Access Tokens tab.
You can find the required values in your Databricks instance by navigating to Clusters and selecting the desired cluster, and selecting the JDBC/ODBC tab under Advanced Options.
Configuring Cloud Storage
The cmdlet supports DBFS, Azure Blob Storage, and AWS S3 for uploading CSV files.
DBFS Cloud Storage
To use DBFS for cloud storage, set the CloudStorageType property to DBFS.
Azure Blob Storage
Set the following properties:
- CloudStorageType: Azure Blob storage.
- StoreTableInCloud: True to store tables in cloud storage when creating a new table.
- AzureStorageAccount: The name of your Azure storage account.
- AzureAccessKey: The storage key associated with your Databricks account. Find this via the azure portal (using the root account). Select your storage account and click Access Keys to find this value.
- AzureBlobContainer: Set to the name of your Azure Blob storage container.
AWS S3 Storage
Set the following properties:
- CloudStorageType: AWS S3.
- StoreTableInCloud: True to store tables in cloud storage when creating a new table.
- AWSAccessKey: The AWS account access key. You can acquire this value from your AWS security credentials page.
- AWSSecretKey: Your AWS account secret key. You can acquire this value from your AWS security credentials page.
- AWSS3Bucket: The name of your AWS S3 bucket.
- AWSRegion: The hosting region for your Amazon Web Services. You can obtain the AWS Region value by navigating to the Buckets List page of your Amazon S3 service, for example, us-east-1.
Authenticating to Databricks
CData supports the following authentication schemes:- Personal Access Token
- Microsoft Entra ID (Azure AD)
- Azure Service Principal
- OAuthU2M
- OAuthM2M
Personal Access Token
To authenticate, set the following:
- AuthScheme: PersonalAccessToken.
- Token: The token used to access the Databricks server. It can be obtained by navigating to the User Settings page of your Databricks instance and selecting the Access Tokens tab.
Microsoft Entra ID (Azure AD)
Note: Microsoft has rebranded Azure AD as Entra ID. In topics that require the user to interact with the Entra ID Admin site, we use the same names Microsoft does. However, there are still CData connection properties whose names or values reference "Azure AD".
Before you can authenticate using Entra ID, you must first register an application with the Entra ID endpoint in the Azure portal, as described in Creating an Entra ID (Azure AD) Application.
(See also Microsoft's own Configure an app in Azure portal.)
Once the application has been completed, set these properties:
- AuthScheme: AzureAD.
- AzureTenant: The "Directory(tenant) ID" in the AzureAD application "Overview" page
- OAuthClientId: The "Application(client) ID" in the AzureAD application "Overview" page.
- CallbackURL: The "Redirect URIs" in AzureAD application "Authentication" page
When connecting, a web page opens that prompts you to authenticate. After successful authentication, the connection is established.
Example connection string:
"Server=https://adb-8439982502599436.16.azuredatabricks.net;HTTPPath=sql/protocolv1/o/8439982502599436/0810-011933-odsz4s3r;database=default; AuthScheme=AzureAD;InitiateOAuth=GETANDREFRESH;AzureTenant=94be69e7-edb4-4fda-ab12-95bfc22b232f;OAuthClientId=f544a825-9b69-43d9-bec2-3e99727a1669;CallbackURL=http://localhost;"
Azure Service Principal
To authenticate, set the following properties:- AuthScheme: AzureServicePrincipal.
- AzureTenantId: The tenant ID of your Microsoft Azure Active Directory.
- AzureClientId: The application (client) ID of your Microsoft Azure Active Directory application.
- AzureClientSecret: The application (client) secret of your Microsoft Azure Active Directory application.
OAuthU2M
OAuthU2M (User-to-Machine) authentication allows users to grant applications, such as CLI or SDK, access to their workspace. It uses a secure OAuth token, eliminating the need to share the user's password.The following explains how OAuthU2M works:
After a user signs in and consents to the OAuthU2M authentication request, the tool or SDK receives an OAuth token. This token allows the tool or SDK to authenticate on the user's behalf.
By default, the cmdlet uses an embedded OAuth application with a redirect URL of http://localhost:8020 which requires no setup. However, to customize the redirect URL or scopes used during authentication, you can register a custom OAuth application in the Databricks Account Console.
For instructions on registering a custom OAuth application, see Creating a Custom OAuth Application.
The required settings are:
- AuthScheme: OAuthU2M
- OAuthLevel: Set to the level at which you want to request the token.
- OAuthClientId: Assigned when you register your application with an OAuth authorization server.
- CallbackURL: The redirect URL registered with your OAuth application.
- DatabricksAccountId: Required only when the OAuthLevel is set to AccountLevel.
OAuthM2M
OAuthM2M (Machine-to-Machine) authentication verifies the identiy of devices or applications communicating over a network. It ensures that only authorized machines can securely exchange data and access resources without human intervention.The following explains how OAuthM2M works:
Register your application with the authorization server to obtain a client ID and secret. When accessing a protected resource, your machine sends a request with these credentials and desired scopes. The server verifies the provided information and, if valid, returns an access token. This token is included in the request header for API calls to access the resource.
The required settings are:
- AuthScheme: OAuthM2M
- OAuthLevel: Set to the level at which you want to request the token.
- OAuthClientId: Assigned when you register your application with an OAuth authorization server.
- OAuthClientSecret: Assigned when you register your application with an OAuth authorization server.
- DatabricksAccountId: Required only when the OAuthLevel is set to AccountLevel.
Creating a Connection Object
You can then use the Connect-Databricks cmdlet to create a connection object that can be passed to other cmdlets:
$conn = Connect-Databricks -Server "127.0.0.1" -HTTPPath "MyHTTPPath"" -User "MyUser" -Token "MyToken"
Retrieving Data
The Select-Databricks cmdlet provides a native PowerShell interface for retrieving data:
$results = Select-Databricks -Connection $conn -Table "[CData].[Sample].Customers" -Columns @("City, CompanyName") -Where "Country='US'"
The Invoke-Databricks cmdlet provides an SQL interface. This cmdlet can be used to execute an SQL query via the Query parameter.
Piping Cmdlet Output
The cmdlets return row objects to the pipeline one row at a time. The following line exports results to a CSV file:
Select-Databricks -Connection $conn -Table [CData].[Sample].Customers -Where "Country = 'US'" | Select -Property * -ExcludeProperty Connection,Table,Columns | Export-Csv -Path c:\my[CData].[Sample].CustomersData.csv -NoTypeInformation
You will notice that we piped the results from Select-Databricks into a Select-Object cmdlet and excluded some properties before piping them into an Export-CSV cmdlet. We do this because the CData Cmdlets append Connection, Table, and Columns information onto each row object in the result set, and we do not necessarily want that information in our CSV file.
However, this makes it easy to pipe the output of one cmdlet to another. The following is an example of converting a result set to JSON:
PS C:\> $conn = Connect-Databricks -Server "127.0.0.1" -HTTPPath "MyHTTPPath"" -User "MyUser" -Token "MyToken"
PS C:\> $row = Select-Databricks -Connection $conn -Table "[CData].[Sample].Customers" -Columns (City, CompanyName) -Where "Country = 'US'" | select -first 1
PS C:\> $row | ConvertTo-Json
{
"Connection": {
},
"Table": "[CData].[Sample].Customers",
"Columns": [
],
"City": "MyCity",
"CompanyName": "MyCompanyName"
}
Deleting Data
The following line deletes any records that match the criteria:
Select-Databricks -Connection $conn -Table [CData].[Sample].Customers -Where "Country = 'US'" | Remove-Databricks
Modifying Data
The cmdlets make data transformation easy as well as data cleansing. The following example loads data from a CSV file into Databricks, checking first whether a record already exists and needs to be updated instead of inserted.
Import-Csv -Path C:\My[CData].[Sample].CustomersUpdates.csv | %{
$record = Select-Databricks -Connection $conn -Table [CData].[Sample].Customers -Where ("_id = `'"+$_._id+"`'")
if($record){
Update-Databricks -Connection $conn -Table [CData].[Sample].Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName) -Where "_id = `'$_._id`'"
}else{
Add-Databricks -Connection $conn -Table [CData].[Sample].Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName)
}
}