Databricks Advanced Configuration
Version 24.2.9013
Version 24.2.9013
Databricks Advanced Configuration
The Advanced tab allows you to configure advanced properties of the Databricks connector. The sections below detail the functionality and configuration of:
Many of the other settings on the Advanced tab are dynamically loaded from the data source you are working with, so they are not covered in this documentation. Settings that are not data source-specific are described in Advanced Tab.
You can find detailed information on all configuration settings for your data source on the Online Help Files page of the CData website. From this page, click the tile for your data source.
Tip: You can highlight any field in the application to display a tooltip with more information about the functionality of that field.
XML Models
The Databricks connector models Databricks tables in CData Arc as XML. The modelling follows this structure:
- A parent element identifies the table to insert into.
- Each child element corresponds to a column in the target table.
This establishes a connection between XML elements and Databricks values, which allows the connector to read values from incoming XML documents and use them in INSERT or UPDATE queries.
The XML Model for a table can be viewed by clicking </> Code above the configuration pane. The following XML is a simplified example model for a table called Account, which is modeled as the XML element <Account>:
<Items>
<Account action="upsert">
<Id key="true" />
<account_c />
<AccountNumber />
<AccountSource />
</Account>
</Items>
In the above example, Id, account_c, AccountNumber, and AccountSource are columns in the Account table. When the connector processes an XML file that matches this structure, it automatically inserts the values from that XML into the Account table. When the connector queries the Account table, it outputs XML that matches this structure.
Batching Input
The Databricks connector supports batching to improve performance when inserting large sets of data. Batching is configured via two fields under the Advanced Settings section: TransactionSize and BatchInputSize. When these fields are set to positive integers, the connector inserts [BatchInputSize] records in a single operation (a batch), and it inserts [TransactionSize] records in a single connection (a transaction).
An XML input file can contain more records than a single transaction size. If this happens, the connector inserts the data over multiple transactions.
If an error occurs during a batch insert, the connector rolls back the transaction and caches the index of the first record in the failed transaction. When the input file is retried, only records after the cached index are inserted into Databricks.
Batching Output
When querying for output, you can configure the connector to support batching records together into a single document. The Max Records setting governs how many records should be included in a single message, and the Batch Size setting indicates how many Batch Messages should be included in a single Batch Group.
Child Tables
The Databricks connector supports Upserting and Selecting multiple tables with a single action. When you create a table model with multiple tables, the tables should have a foreign key relationship where one table is the child of another table.
To interface with child tables, first create a model that targets the parent table. Then, use the +Add button to add child tables from the configuration panel. When you select the child table from the list of tables, the designer automatically populates the hierarchical structure of tables. Select which columns of the child table to include in the model.
Child Table XML
The Input Mapping XML structure is multi-layered when inserting into parent and child tables. The following XML snippet is an example Input Mapping for two tables related by a foreign key:
<Account action="upsert">
<Id key="true" />
<account_c />
<AccountNumber />
<AccountSource />
<AccountPartner>
<Id key="true" />
<AccountFromId ref="@Id" />
<IsPrimary />
<OpportunityId />
<Role />
</AccountPartner>
</Account>
In the above example, <Account> is the parent table and <AccountPartner> is the child table. When the Databricks connector processes an input XML file that matches this structure, it updates both tables and ensures that the keys linking the two records are the same.
REF
Some columns in child tables are marked with a ref tag. This tag indicates that these columns correspond to the primary key in the parent table. After making the insert into the parent table, and before making the insert into the child table, the connector queries the parent table for the primary key of the record it just inserted. The connector then uses this value when inserting into the child table.
This is useful when the primary key of the parent table is generated by Databricks. In this case, this column is not passed in as input XML, so the connector does not know the value of the column until after it has inserted the record into the parent table.
In the XML model, the ref attribute indicates that a column can be referenced later or that the value for a column should be read from another column in the mapping. In order to reference the value of columns that are generated by Databricks, the column element should include a ref attribute set to the reserved value LAST_INSERT_ID
:
<Item action="upsert">
<Id key="true" upsert="id" ref="@@LAST_INSERT_ID" />
This indicates that the id value will be retrieved after the insert and can be referenced later (usually in a child table). To reference this value later, set the ref element to the name of the relevant column:
<id key="true" ref="@id" />
In the above example, id is the name of the primary key in the parent table, and it is also the name of a column in the child table. The ref syntax instructs the connector to retrieve the id from the parent table after inserting, and then use that retrieved value as the id column in the child table.
Selecting From Child Tables
Typically, when Selecting from a parent table and a child table, records should only be pulled from the child table if a foreign key matches the primary key from the parent table.
For example, imagine a Select that pulls purchase order data from a PurchaseOrders parent table and a PurchaseOrderLineItems child table. Each record (order) pulled from PurchaseOrders should also pull the line items associated with that order. The two tables are related by a shared PONumber column which is the primary key of the PurchaseOrders table and a foreign key in the PurchaseOrdersLineItems table. Line item records should only be pulled from the child table if the line item’s PONumber matches the PONumber of the order being pulled from the parent table.
To establish this relationship in Arc, add a new Filter to the Select and configure it as follows:
- In the left-hand dropdown, select the column of the child table that is the foreign key to the parent table. In the above example, this is the PONumber column of the PurchaseOrderLineItems table.
- In the middle (operator) dropdown, select equal.
- In the right-hand input box, first use the dropdown arrow to the right to select REF. Afterwards, the input box should have the $ prefix, indicating that it is referencing the parent table.
- Use the right-hand dropdown to select the parent table column that should match the child table’s foreign key. In the above example, this is the PONumber column of the PurchaseOrders table.
Only Process New or Changed Records
Select actions can be configured to only retrieve records that have been updated or added recently. The connector can use two approaches for detecting which records are new or modified.
- The first approach requires a DateTime column that represents the last-modified time for records.
- The second approach requires a column that explicitly stores a value representing whether a record should be processed or not.
Both options are set on the Settings tab in the Advanced settings at the bottom of the configuration panel. To further refine your requirements, see the three additional Process Changes… settings in the Advanced Settings portion of the Advanced tab.
Using a Last Modified Column
If the table includes a DateTime column, a checkbox appears in the Advanced section to Use column for processing new or changed records. Set this to the DateTime column that represents the last-modified time for records.
Using a Custom Column
Select actions can be configured to update a column in the table for records that are successfully pulled from Databricks. You can combine this with a Filter rule that only selects records with a column value that indicates they have not yet been pulled (or should be pulled again).
To enable column updating after processing, use the Update column with value when rows are processed successfully checkbox in the Advanced section, then specify which column should be updated, and what value should be set in the column.
For example, you could configure an Output Mapping to update the Processed column to 1 after pulling a record. Then you could apply a Filter rule that restricts records to only those where Processed not equal 1.
In this example, when the connector processes the Select template, it creates a SELECT statement based on the template. The template executes SELECT Id, AccountId, Amount FROM INVOICE
, creates an XML file for every record in the result set based on the template, and places those files in the Output folder for this connector.
Pre-Processing and Post-Processing Queries
You can configure Databricks connectors to execute additional queries before and after executing their configured actions. To configure these additional queries you need to manually modify the query code that the connector generates. The sections below outline this process.
In UPSERT Mappings
In an UPSERT mapping, it is possible to issue queries that automatically take place before or after all records are upserted into a table.
To configure this feature, first click </> Code in the Upsert Configuration pane.
This opens a text editor that allows you to edit advanced settings for a table mapping. The text in the editor might resemble this snippet:
<Items>
<Names table="`dbo`.`Names`" action="upsert">
<ID key="true" />
<Name />
…
</Names>
</Items>
To issue a query before or after any columns are added to the table, define an element with the following attributes:
- requireInput: Set to false. This attribute value instructs the connector to issue the query even without any matching input element in the input file.
- insertQuery: Contains the query that you want to execute either before or after the UPSERT query.
- table (optional): Set to the name of the table that you want to target with the extra query. (In the examples below, the table name is updateTable.) If you issue the query at the root level of the mapping—that is, as a child of the <Items> element—this attribute is not required. If you issue a query as a child to another table, setting the table attribute ensures that the query is distinguished from a column of that table.
Place the newly defined element before or after the UPSERT query, depending on when you want it to execute.
In this example, the element <runFirst> executes the query EXEC [dbo.FlushTemp] before any UPSERT requests to the table:
<Items>
<runFirst table="updateTable" insertQuery="EXEC [dbo].[FlushTemp]" requireInput="false" />
<Names table="`dbo`.`Names`" action="upsert">
<ID key="true" />
<Name />
…
</Names>
</Items>
In this example, the element <runLast> executes the query EXEC [dbo].[CommitTemp] after all UPSERT requests to the table are completed.
<Items>
<Names table="`dbo`.`Names`" action="upsert">
<ID key="true" />
<Name />
…
</Names>
<runLast table="updateTable" insertQuery="EXEC [dbo].[CommitTemp]" requireInput="false" />
</Items>
The element names runFirst and runLast are placeholders. You can modify them to meet your requirements.
Note: PreProcessing and PostProcessing queries in an UPSERT mapping do not accept input elements that are not hardcoded in the query, and they do not generate output data.
In SELECT Mappings
In a SELECT mapping, it is possible to issue queries that automatically take place before or after queries are issued to the targeted table. To configure this feature, first click </> Code in the Select Configuration pane.
This opens a text editor that allows you to edit advanced settings for a table mapping.
To issue a query before or after any columns are added to the table, define an element with the following attributes:
- outputResult: Set to false. This attribute value instructs the connector not to return any data in the output file as a result of the query.
- selectQuery: Contains the query that you want to execute either before or after the SELECT query.
- table (optional): Set to the name of the table that you want to target with the extra query. (In the examples below, the table name is selectTable.) If you issue the query at the root level of the mapping—that is, as a child of the <Items> element—this attribute is not required. If you issue a query as a child to another table, setting the table attribute ensures that the query is distinguished from a column of that table.
Place the newly defined element before or after the SELECT query, depending on when you want it to execute.
In this example, the element <runFirst> executes the query EXEC [dbo].[prepTable] before the SELECT query is issued against the table:
<Items>
<runFirst table="selectTable" selectQuery="EXEC [dbo].[prepTable]" outputResult="false" />
<Names table="`dbo`.`Names`" action="select">
<ID key="true" />
<Name />
…
</Names>
</Items>
In this example, the element <runLast> executes the query EXEC [dbo].[cleanTable] after the SELECT query and any child queries are issued against the table:
<Items>
<Names table="`dbo`.`Names`" action="select">
<ID key="true" />
<Name />
…
</Names>
<runLast table="selectTable" selectQuery="EXEC [dbo].[cleanTable]" outputResult="false" />
</Items>
The element names runFirst and runLast are placeholders. You can modify them to meet your requirements.
Note: PreProcessing and PostProcessing queries in a SELECT mapping do not accept input elements that are not hardcoded in the query.