Change Data Capture

Version 26.1.9516

What Is a CDC Pipeline?
Determining When to Use CDC
Enabling CDC for the Source Database
Creating a CDC Job in CData Sync
Adding Tasks to Your CDC Job
CDC Support for Tables Without a Primary Key
Adding a Post-Job Transformation
Using the Sync CDC Engine

Change Data Capture

Some sources support change data capture (CDC), where a source uses a log file to log events (Insert, Update, or Delete) that cause changes in the database. Rather than querying the source table for changes, CData Sync reads the log file for any change events. Then, the application extracts those changes in near real time for replication and stores the current log position for the next replication.

The following sources support CDC capability. Some of these sources use the CDC Engine to capture and stream changes in near real time. (For a list of the sources that use the CDC Engine, see Enabling Change Capture Data for CDC Engine Sources).

DB2 (Native)—Uses supplemental logging.
DB2 for i (Native)—Uses journaling.
Informix (Native)—Uses the Change Data Capture API.
MariaDB—Uses binary logs.
Microsoft Dynamics 365—Uses change tracking.
MySQL—Uses binary logs.
Oracle (Native)—Uses Oracle LogMiner.
PostgreSQL (Native)—Uses logical replication.
SAP ERP—Uses the Change Data Capture API.
SQL Server—Uses either CDC or change tracking. If both methods are enabled on a table, Sync uses CDC.

This document explains the following concepts and tasks:

defining a CDC pipeline
determining when to use CDC
enabling CDC for a source database
creating a CDC job in Sync
adding tasks to your CDC job
adding a post-job transformation

What Is a CDC Pipeline?

Change data capture (CDC) creates a pipeline that enables you to stream data changes from a source to supported destinations so you can run transformations on that data. Supported destinations can include any of the following structures:

another relational database
a cloud storage system or a data lake
a data warehouse (for example, Snowflake, Amazon Redshift, and Google BigQuery)
a file storage system
a message queue (for example, Kafka or Kinesis)

Determining When to Use CDC

The CDC approach integrates data through a process of identification, capture, and delivery of changes that are made to a source database. Those data changes are stored in a transaction log.

This approach makes sense in the following situations:

You are using one of the sources that supports CDC. As mentioned earlier, Sync supports native, log-based change data capture for MySQL, Oracle, PostgreSQL, and SQL Server.
You need near, real-time data. The CDC process provides near real-time data transfer in ETL and ELT processing.
You want to limit or preserve resource use. CDC data integration has low impact on your system because it does not make changes at the application level, nor does it scan transactional tables.

Enabling CDC for the Source Database

The way in which you enable CDC is different for each of the database sources that support CDC. For information about how to enable CDC for your source, click the appropriate link below.

Creating a CDC Job in CData Sync

Creating a job requires pre-configured source and destination connections. See Connections for more information about creating your source and destination connections.

After you define connections to your data source and your destination database, follow these steps to create a new job.

Select Add Job > Add New Job on the Jobs tab in Sync. This action opens the Add Job dialog box.
Enter your job name and select one of the supported CDC sources and a destination in the dialog box.
Select Change Data Capture as the replication type.
Select a schema from the Destination Schema list.
Click Add Job to create your job.
Return to the Jobs page to access your job.

Adding Tasks to Your CDC Job

Tasks control the data flow from a source into a destination table. In a standard replication job, all source tables are available to be added as replication tasks to the job.

To add tables as replication tasks:

Click a job on the Jobs tab in Sync.
Click Add Tasks in the Job Settings section of the Jobs/YourJobName page. This action opens the Select Schema dialog box.
Select the schema (for example, public) from the schema list.
Select specific tables in the dialog box or select the checkbox next to the source name to select all tables.
Click Add Tasks to add your new tasks.
Return to the Jobs page to access and run your job.

For more information about creating tasks, see Tasks.

CDC Support for Tables Without a Primary Key

Sync supports CDC for source tables that lack a primary key by adding a new column (_cdatasync_id) into the destination table. This column stores a hash value that is generated from the full row content. This hash value acts as a pseudo-unique identifier for a row based on its content. If any value in the row changes, the hash values also changes, which enables Sync to know that the row has been updated, even without a primary key.

After Sync adds the _cdatasync_id column, the application handles the INSERT, UPDATE, and DELETE operations as follows:

INSERT: When a new row is detected in the source, Sync generates the _cdatasync_id value and inserts the row into the destination.
UPDATE: Sync performs a soft delete on the existing row by setting _cdatasync_deleted=true. Then, Sync inserts the new row.
DELETE: Sync performs a soft delete on the row by setting _cdatasync_deleted=true.

This approach (adding a new column) ensures accurate CDC for source tables without a primary key.

Adding a Post-Job Transformation

Sync supports data transformation processes after job completion. Using advanced SQL queries or leveraging your existing dbt Core and dbt Cloud projects enables you to meet all your data needs in a single platform.

For more information about SQL transformations, see SQL Transformations. For more information about DBT transformations, see DBT Transformations.

Using the Sync CDC Engine

Some Change Data Capture (CDC) sources in Sync use the CDC Engine to capture and stream data changes in near real time. The CDC Engine continuously monitors database transaction logs and stages change events for processing by Sync.

CDC Engine

The CDC Engine is a component of Sync that tracks and streams real-time data changes from a source database. By monitoring transaction logs (for example, write-ahead logs or redo logs), the CDC Engine identifies data updates and temporarily stores these change events as files in a staging area for further processing. This staging area is in the jobs folder of the application directory (C:\ProgramData\CData\sync\jobs).

Each CDC job that uses the CDC Engine has a dedicated engine instance with a unique subfolder in the staging area. For each table that is included in the job, Sync creates a corresponding subfolder that contains the staged change event files.

Managing the Stage File

To optimize the storage of change events, you can configure the stage.file.max.rows property, which specifies the maximum number of rows or lines that can be written to a single stage file. By default, this value is set to 100,000 rows. For jobs that involve tables with numerous columns or large objects (for example, binary data), CData recommends that you decrease the value of this property.

You can configure this property according to your table schema on the job’s Advanced tab, as shown in the following example:

Managing the Stage Limit

Because all CDC jobs share the same staging area, it is essential to manage the staging area’s size carefully. By default, the stage is limited to 10 GB to prevent disk overloading. You can modify this limit by adjusting the stagemaxsize property. To do so, navigate to Settings > Advanced > Additional Settings > Other Settings and configure the property to an appropriate value (for example, stagemaxsize=20), as shown below.

Note: The stagemaxsize property accepts only integer values, which are interpreted in gigabytes (GB) by default.

If the staging area reaches its size limit, all CDC engines stop automatically. In such cases, you either must run the jobs manually or wait for the scheduled jobs to run in order to replicate the accumulated change events to the destination and release the staging area. If one-third of the maximum stage size has been freed, all the CDC engines start automatically to resume the real-time data streaming process.

As an option, you can restart a CDC engine by clicking the Start button on the Overview tab.

Note: Because the CDC engine operates continuously and streams change events in real-time, you should schedule the job to run frequently as a best practice. This practice ensures near real-time replication while preventing the staging area from reaching its size limit.

Enabling Change Data Capture for CDC Engine Sources

The steps that are required to enable Change Data Capture (CDC) vary by source database. For sources that use the CDC Engine, additional configuration might be required to allow Sync to capture changes from transaction logs or native change tracking mechanisms.

For specific setup instructions for your source, select the appropriate link below.

Creating a Job that uses the CDC Engine

Creating a job that uses the CDC Engine requires pre-configured source and destination connections. See Connections for information about adding source and destination connections.

After you define connections to your data source and your destination database, follow these steps to create a new job.

Click Add Job > Add New Job on the Jobs tab in Sync. This action opens the Add Job dialog box.
Enter your job name and select a source that supports CDC and a destination in the dialog box. For sources that use the CDC Engine, Sync automatically configures the job to use engine-based CDC.
Depending on your source, fill in the appropriate properties.
Click Add Job.
Return to the Jobs page to access your job.

After you create a job that uses the CDC Engine, you add tasks in the same way as for any other CDC job.

For more information about creating tasks, see Tasks.