CData Sync supports Incremental Updates, so rather than querying the entirety of your data every time, Sync incrementally updates your data warehouse by only querying for data that has been added or changed since the last time your job ran and merging that data into your data warehouse. This minimizes bandwidth usage and the latency of synchronization.
CData Sync pulls your data using two main methods of incremental update:
Incremental Check Column
It doesn’t make sense to retrieve every bit of your data every time you run an update. You’ll almost always only want to get records that have been created or modified since the last time the update ran. Incremental Check Column works with most data sources to selectively update only records that have been modified since the last time, using the table’s Last Modified column.
Updates by Incremental Check Column can work using two different data types:
- DateTime Incremental Check Columns: DateTime Incremental Check Columns have been pre-configured for most sources, and other sources can select which columns will be used.
- Integer-based Incremental Check Columns: Sync supports Integer-based Incremental Check Columns (for example, an auto-incrementing Id).
In certain situations, databases don’t include a Last Modified column, and an alternative solution is necessary. Change Tracking offers a great solution for SQL Server and Oracle databases, since it can capture deletes and offer other options. Change Tracking records information whenever DML statements modify rows in the tracked table. Sync queries the internal change tracking table for records that have changed and updates the destination table accordingly.
- SQL Server uses the SQL Change Tracking feature to retrieve records with newer versions.
- Oracle uses Oracle Flashback Query to retrieve records with updated SCN values.
CData Sync automatically captures deletes, assuring accuracy in your destination. Sync retrieves a list of deleted records from the source by calling the API or by using the Change Tracking feature.
If the source allows CData Sync to detect data that has been deleted, you can control how CData Sync handles deletes using the Deletion Behavior option in the Advanced section.
- Hard Delete (Default) - If a delete is detected in the source, Sync will remove that record from the destination table.
- Soft Delete - Sync will add the “_cdata_deleted” column to your destination table. If a delete is detected in the source, Sync will set the value to “true” in the destination.
- Skip Delete - Sync will ignore deleted records in the source.
Note that some APIs do not allow Sync to detect deleted records, indicated in the Source Table Information. In these cases, Deletion Behavior will be ignored.
CData Sync recognizes a large number of data types and, in situations where the data type isn’t strictly defined, Sync can infer the data type based on your data. Sync recognizes the following data types:
Known Data Types
There are many sources, mostly relational databases and some APIs, where Sync can automatically detect the data types for the schema. Examples of this include SQL Server, Oracle, and Salesforce. When the source column data types are known, Sync automatically creates matching data types in the destination.
Inferred Data Types
In situations where a data type isn’t specified, CData Sync can infer the data type by scanning the first few rows of data to determine what the column data type should be.
When Sync detects a String type with an unknown column size, the default size of the column is treated as 2000. In a relational database such as SQL Server, Sync would create a varchar(2000) field for this type.
Sync automatically selects the smallest type possible when assigning data types to fields that aren’t strictly defined: Sync reads the first row and selects the smallest datatype for each column. Then it reads the next row and ensures that the data still fits in those data types. If not, the data type size is increased. Sync does this all the way up to the RowScanDepth (50 or 100 rows) and at the end, it has the data types.
For example, in a source like CSV, Sync uses “RowScan” to read through the first rows of the file and dynamically determine the datatypes for each column.
In data pipelines, transformations are a way of transforming, cleansing, or aggregating data in way that makes it easy for reporting or data analysis. CData Sync supports two common ways of approaching data transformations when building data pipelines:
ETL: ETL (extract, transform, load) has been the traditional approach in analytics for several decades. It was originally designed to work with relational databases which have historically dominated the market. ETL requires the transformations to happen before the loading process. Data is extracted from data sources and then deposited into a staging area. Data is then cleaned, enriched, transformed and finally loaded into the data warehouse. For more on ETL, refer to the ETL Documentation.
ELT: ELT (Extract, Load, Transform) is a method of data extraction where modifications to the data, such as Transformations, take place after the loading process. The vastly greater storage and scaling capabilities of modern cloud data warehouses allow data to be moved in its entirety and then modified after the fact. The primary difference between ETL and ELT is the order in which the steps take place.
ELT transformations are SQL scripts executed in the destination on your data. Transformations use the processing power of the data warehouse to quickly aggregate, join, and clean your data based on your analytics and reporting needs. By organizing your data with Transformations and Mapping, you can receive your data in the form most useful to you as it moves through your pipeline. Similar to Jobs, Transformations support multiple queries separated by ‘;’, executing on a schedule, and email notifications once complete. For more on ELT, refer to the ELT Documentation.
CData Sync’s built-in REST API provides a flexible way to manage the application. Everything that can be accomplished within the administration console UI can also be accomplished through RESTful API calls.
The Job Management API allows you to create, update and execute jobs
- Edit a Job
- Start a Job
- Retrieve Job Status
- Manage Connections
For more information on API Access, refer to the REST-API documentation.
CData Sync can run anywhere. This makes it ideal for customers who have some systems in the cloud and others that reside inside their internal networks. In these scenarios, customers can install Sync to run inside their network, so there is no need to expose ports over the internet or open firewalls, create VPN connections, etc.
Sync’s ability to run anywhere also greatly reduces the amount of latency, as you can run CData Sync close to the source or destination, improving the performance of your ETL or ELT jobs.
Your data is constantly changing, and Sync makes sure those changes are accurately represented at all times. Every run, Sync compares the source schema to the destination schema to look for differences. If a difference in structure between the two is detected, Sync will modify the destination schema to ensure the source data fits by adding columns or increasing the column size.
Sync can make a number of different changes. These include:
- If a column is detected in the source table that does not exist in the destination table, Sync will alter the destination table by adding the column.
- If the datatype in the source increases in size, Sync will alter the destination table by updating the column. This includes increasing the columnsize of a string column (varchar(255) -> varchar(2000)) or the byte size of non-string columns (smallint -> integer).
- Sync will never delete columns from your destination table if the column was removed from your source table.
- Sync will never shrink the size of a destination column if the datatype has been updated in the source (varchar(2000) -> varchar(255)).
If you have questions or comments, we’d love to hear them! Go ahead and send us an email.