Advanced Integrations
The following sections detail connector settings that may be needed in advanced integrations.
Saving Result Sets
Large result sets must be saved in a temporary or permanent table. You can use the following properties to control table persistence:
Automatic Result Tables
Enable the AllowLargeResultSets property to make the connector automatically create destination tables when needed. If a query result is too large to fit the BigQuery query cache, the connector creates a hidden dataset within the data project and re-executes the query with a destination table in that dataset. The dataset is configured so that all tables created within it expire in 24 hours.
In some situations you may want to change the name of the dataset created by the connector. For example, if multiple users are using the connector and do not have permissions to write to datasets created by the other users. See TempTableDataset for details on how to do this.
Explicit Result Tables
Enable the DestinationTable property to make the connector write query results to the given table. Writing query results to a single table imposes several limitations that you should keep in mind when using this option:
- Two query results cannot be read at the same time on the same connection. If two queries are executed and their results are read at the same time, the last query to finish executing overwrites the data from the other query.
- The dataset must be created in the same region as your tables. BigQuery does not support writing a destination table in a different region than where a query was executed.
- Do not rely on the connector to create a temporary table for every query. Some queries are processed internally or read directly from a table without executing a query job on BigQuery.
Limiting Billing
Set MaximumBillingTier to override your project limits on the maximum cost for any given query in a connection.
Bulk Modes
Google BigQuery provides several interfaces for operating on batches of rows. The connector supports these methods through the InsertMode option, each of which are specialized to different use cases:
- The Streaming API is intended for use where the most important factor is being able to insert quickly. However, rows which are inserted via the API are queued and only appear in the table after a delay. Sometimes this delay can be as high as 20-30 minutes which makes this API incompatible with cases where you want to insert data and then run other operations on it immediately. You should avoid modifying the table while any rows are in the streaming queue: Google BigQuery prevents DML operations from running on the table while any rows are in the streaming queue, and changing the table's metadata (name, schema, etc.) may cause streamed rows that haven't been committed to be lost.
- The DML mode API uses Standard SQL INSERT queries to upload data. This is by the most robust method of uploading data because any errors in the uploaded rows will be reported immediately. The connector also uses this API in a synchronous way so once the INSERT is processed, any rows can be used by other operations without waiting. However, it is by far the slowest insert method and should only be used for small data volumes.
- The Upload mode uses the multipart upload API for uploading data. This method is intended for performing low-cost medium to large data loads within a reasonable time. When using this mode the connector will upload the inserted rows to Google-managed storage and then create a load job for them. This job will execute and the connector can either wait for it (see WaitForBatchResults) or let it run asyncronously. Waiting for the job will report any errors that the job enconters but will take more time. Determining if the job failed without waiting for it requires manually checking the job status via the job stored procedures.
- The GCSStaging mode is the same as Upload except that it uses your Google Cloud Storage acccount to store staged data instead of Google-managed storage.
The connector cannot act asynchronously in this mode because it must delete the file after the load is complete, which means that WaitForBatchResults has no effect.
Because this depends on external data, you must set the GCSBucket to the name of your bucket and ensure that Scope (a space delimited set of scopes) contains at least the scopes https://www.googleapis.com/auth/bigquery and https://www.googleapis.com/auth/devstorage.read_write. The devstorage scope used for GCS also requires that you connect using a service account because Google BigQuery does not allow user accounts to use this scope.
In addition to bulk INSERTs, the connector also supports performing bulk UPDATE and DELETE operations. This requires the connector to upload the data containing the filters and rows to set into a new table in BigQuery, then perform a MERGE between the two tables and drop the temporary table. InsertMode determines how the rows are inserted into the temporary table but the Streaming and DML modes are not supported.
In most cases the connector can determine what columns need to be part of the SET vs. WHERE clauses of a bulk update. If you receive an error like "Primary keys must be defined for bulk UPDATE support," you can use PrimaryKeyIdentifiers to tell the connector what columns to treat as keys. In an update the values of key columns are used only to find matching rows and cannot be updated.