Change Data Capture
Version 23.4.8843
Version 23.4.8843
Change Data Capture
Some sources support change data capture (CDC), where a source uses a log file to log events (Insert
, Update
, or Delete
) that cause changes in the database. Rather than querying the source table for changes, CData Sync reads the log file for any change events. Then, the application extracts those changes in near real time for replication and stores the current log position for the next replication.
The following sources support CDC capability:
-
MariaDB - Uses binary logs.
-
MySQL - Uses binary logs.
-
Oracle - Uses Oracle Flashback.
-
PostgreSQL - Uses logical replication.
-
SQL Server - Uses either change tracking or change data capture.
This document explains the following concepts and tasks:
-
defining a CDC pipeline
-
determining when to use CDC
-
enabling CDC for a source database
-
creating a CDC job in Sync
-
adding tasks to your CDC job
-
adding a post-job transformation
What Is a CDC Pipeline?
Change data capture (CDC) creates a pipeline that enables you to stream data changes from a source to supported destinations so you can run transformations on that data. Supported destinations can include any of the following structures:
-
another relational database
-
a cloud storage system or a data lake
-
a data warehouse (for example, Snowflake, Amazon Redshift, and Google BigQuery)
-
a file storage system
-
a message queue (for example, Kafka or Kinesis)
Determining When to Use CDC
The CDC approach integrates data through a process of identification, capture, and delivery of changes that are made to a source database. Those data changes are stored in a transaction log.
This approach makes sense in the following situations:
-
You are using one of the sources that supports CDC. As mentioned earlier, Sync supports native, log-based change data capture for MySQL, Oracle, PostgreSQL, and SQL Server.
-
You need near, real-time data. The CDC process provides near real-time data transfer in ETL and ELT processing.
-
You want to limit or preserve resource use. CDC data integration has low impact on your system because it does not make changes at the application level, nor does it scan transactional tables.
Enabling CDC for the Source Database
The way in which you enable CDC is different for each of the four database sources. For information about how to enable CDC for your source, click the appropriate link below.
Creating a CDC Job in CData Sync
Creating a job requires pre-configured source and destination connections. See Connections for more information about creating your source and destination connections.
After you define a source connection to your data source and a destination connection to your database, follow these steps to create a new job.
-
Click Create Job on the Jobs tab in Sync. This action opens the Create New Job dialog box.
-
Enter your job name and select one of the supported CDC sources and a destination in the dialog box.
-
Select CDC as the replication type.
-
Click Create.
Adding Tasks to your CDC Job
Tasks control the data flow from a source into a destination table. In a standard replicate job, all source tables and views are available to be added as replicate tasks to the job.
To add tables and views as replicate tasks:
-
Click a job on the Jobs tab in Sync.
-
Click Add Tasks in the Job Settings section of the Jobs/YourJobName page. This action opens the Add Tasks dialog box.
-
Select the schema from the Schema list.
-
Select specific tables in the dialog box or select the Table Name checkbox to select all tables.
-
Click Add to add your new tasks.
For more information about creating tasks, see Tasks.
Adding a Post-Job Transformation
Sync supports data transformation processes after job completion. Using advanced SQL queries or leveraging your existing dbt Core and dbt Cloud projects enables you to meet all your data needs in a single platform.
For more information about SQL transformations, see SQL Transformations. For more information about DBT transformations, see DBT Transformations.