Advanced Integrations
The following sections detail driver settings that may be needed in advanced integrations.
Saving Result Sets
Large result sets must be saved in a temporary or permanent table. You can use the following properties to control table persistence:
Accessing Temporary Tables
You can use the following properties to manage temporary tables. These temporary tables are managed by Google BigQuery and automatically expires after 24 hours.
- AllowLargeResultSets: Store large result sets in temporary tables in a hidden dataset. If this is not set, an error is returned for result sets bigger than a certain size.
Accessing Persistent Tables
Persist result sets larger than 128MB in a permanent table. To do so, you can set DestinationTable to the qualified name of the table. For example, "DestinationProjectId:DestinationDataSet.TableName". Subsequent queries replace the table with the new result set.
Limiting Billing
Set MaximumBillingTier to override your project limits on the maximum cost for any given query in a connection.
Bulk Modes
Google BigQuery provides several interfaces for operating on batches of rows. The driver supports these methods through the InsertMode option, each of which are specialized to different use cases:
- The Streaming API is intended for use where the most important factor is being able to insert quickly. However, rows which are inserted via the API are queued and only appear in the table after a delay. Sometimes this delay can be as high as 20-30 minutes which makes this API incompatible with cases where you want to insert data and then run other operations on it immediately. You should avoid modifying the table while any rows are in the streaming queue: Google BigQuery prevents DML operations from running on the table while any rows are in the streaming queue, and changing the table's metadata (name, schema, etc.) may cause streamed rows that haven't been committed to be lost.
- The DML mode API uses Standard SQL INSERT queries to upload data. This is by the most robust method of uploading data because any errors in the uploaded rows will be reported immediately. The driver also uses this API in a synchronous way so once the INSERT is processed, any rows can be used by other operations without waiting. However, it is by far the slowest insert method and should only be used for small data volumes.
- The Upload mode uses the multipart upload API for uploading data. This method is intended for performing low-cost medium to large data loads within a reasonable time. When using this mode the driver will upload the inserted rows to Google-managed storage and then create a load job for them. This job will execute and the driver can either wait for it (see WaitForBatchResults) or let it run asyncronously. Waiting for the job will report any errors that the job enconters but will take more time. Determining if the job failed without waiting for it requires manually checking the job status via the job stored procedures.
- The GCSStaging mode is the same as Upload except that it uses your Google Cloud Storage acccount to store staged data instead of Google-managed storage.
The driver cannot act asynchronously in this mode because it must delete the file after the load is complete, which means that WaitForBatchResults has no effect.
Because this depends on external data, you must set the GCSBucket to the name of your bucket and ensure that Scope (a space delimited set of scopes) contains at least the scopes https://www.googleapis.com/auth/bigquery and https://www.googleapis.com/auth/devstorage.read_write. The devstorage scope used for GCS also requires that you connect using a service account because Google BigQuery does not allow user accounts to use this scope.
In addition to bulk inserts, the driver also supports performing bulk UPDATE and DELETE operations. This requires the driver to upload the data containing the filters and rows to set into a new table in BigQuery, then perform a MERGE between the two tables and drop the temporary table. InsertMode determines how the rows are inserted into the temporary table but the Streaming and DML modes are not supported.
In most cases the driver can determine what columns need to be part of the SET vs. WHERE clauses of a bulk update. If you receive an error like "Primary keys must be defined for bulk UPDATE support," you can use PrimaryKeyIdentifiers to tell the driver what columns to treat as keys. In an update the values of key columns are used only to find matching rows and cannot be updated.