Change Data Capture

Version 26.2.9623


Change Data Capture


Some sources support change data capture (CDC), where a source uses a log file to log events (Insert, Update, or Delete) that cause changes in the database. Rather than querying the source table for changes, CData Sync reads the log file for any change events. Then, the application extracts those changes in near real time for replication and stores the current log position for the next replication.

The following sources support CDC capability. Some of these sources use the CDC Engine to capture and stream changes in near real time. (For a list of the sources that use the CDC Engine, see Enabling Change Capture Data for CDC Engine Sources).

  • DB2 (Native)—Uses supplemental logging.

  • DB2 for i (Native)—Uses journaling.

  • Informix (Native)—Uses the Change Data Capture API.

  • MariaDB—Uses binary logs.

  • Microsoft Dynamics 365—Uses change tracking.

  • MySQL—Uses binary logs.

  • Oracle (Native)—Uses Oracle LogMiner.

  • PostgreSQL (Native)—Uses logical replication.

  • SAP ERP—Uses the Change Data Capture API.

  • SAP HANA—Uses trigger-based CDC.

  • SQL Server—Uses either CDC or change tracking. If both methods are enabled on a table, Sync uses CDC.

    Note: When SQL Server source tables include Always Encrypted columns, CDC supports these columns when Sync connects via JDBC. The JDBC driver returns decrypted plaintext values instead of encrypted (binary or hexadecimal) representations, which allows encrypted columns to be included in CDC-based replication workflows without causing read failures.

This document explains the following concepts and tasks:

  • defining a CDC pipeline

  • determining when to use CDC

  • enabling CDC for a source database

  • creating a CDC job in Sync

  • adding tasks to your CDC job

  • adding a post-job transformation

What Is a CDC Pipeline?

Change data capture (CDC) creates a pipeline that enables you to stream data changes from a source to supported destinations so you can run transformations on that data. Supported destinations can include any of the following structures:

  • another relational database

  • a cloud storage system or a data lake

  • a data warehouse (for example, Snowflake, Amazon Redshift, and Google BigQuery)

  • a file storage system

  • a message queue (for example, Kafka or Kinesis)

Determining When to Use CDC

The CDC approach integrates data through a process of identification, capture, and delivery of changes that are made to a source database. Those data changes are stored in a transaction log.

This approach makes sense in the following situations:

  • You are using one of the sources that supports CDC. As mentioned earlier, Sync supports native, log-based change data capture for MySQL, Oracle, PostgreSQL, and SQL Server.

  • You need near, real-time data. The CDC process provides near real-time data transfer in ETL and ELT processing.

  • You want to limit or preserve resource use. CDC data integration has low impact on your system because it does not make changes at the application level, nor does it scan transactional tables.

Enabling CDC for the Source Database

The way in which you enable CDC is different for each of the database sources that support CDC. For information about how to enable CDC for your source, click the appropriate link below.

Creating a CDC Job in CData Sync

Creating a job requires pre-configured source and destination connections. See Connections for more information about creating your source and destination connections.

After you define connections to your data source and your destination database, follow these steps to create a new job.

  1. Select Add Job > Add New Job on the Jobs tab in Sync. This action opens the Add Job dialog box.

  2. Enter your job name and select one of the supported CDC sources and a destination in the dialog box.

  3. Select Change Data Capture as the replication type.

  4. Select a schema from the Destination Schema list.

  5. Click Add Job to create your job.

  6. Return to the Jobs page to access your job.

Adding Tasks to Your CDC Job

Tasks control the data flow from a source into a destination table. In a standard replication job, all source tables are available to be added as replication tasks to the job.

To add tables as replication tasks:

  1. Click a job on the Jobs tab in Sync.

  2. Click Add Tasks in the Job Settings section of the Jobs/YourJobName page. This action opens the Select Schema dialog box.

  3. Select the schema (for example, public) from the schema list.

  4. Select specific tables in the dialog box or select the checkbox next to the source name to select all tables.

  5. Click Add Tasks to add your new tasks.

  6. Return to the Jobs page to access and run your job.

For more information about creating tasks, see Tasks.

CDC Support for Tables Without a Primary Key

Sync supports CDC for source tables that lack a primary key by adding a new column (_cdatasync_id) into the destination table. This column stores a hash value that is generated from the full row content. This hash value acts as a pseudo-unique identifier for a row based on its content. If any value in the row changes, the hash values also changes, which enables Sync to know that the row has been updated, even without a primary key.

After Sync adds the _cdatasync_id column, the application handles the INSERT, UPDATE, and DELETE operations as follows:

  • INSERT: When a new row is detected in the source, Sync generates the _cdatasync_id value and inserts the row into the destination.

  • UPDATE: Sync performs a soft delete on the existing row by setting _cdatasync_deleted=true. Then, Sync inserts the new row.

  • DELETE: Sync performs a soft delete on the row by setting _cdatasync_deleted=true.

This approach (adding a new column) ensures accurate CDC for source tables without a primary key.

Adding a Post-Job Transformation

Sync supports data transformation processes after job completion. Using advanced SQL queries or leveraging your existing dbt Core and dbt Cloud projects enables you to meet all your data needs in a single platform.

For more information about SQL transformations, see SQL Transformations. For more information about DBT transformations, see DBT Transformations.

Using the Sync CDC Engine

Some Change Data Capture (CDC) sources in Sync use the CDC Engine to capture and stream data changes in near real time. The CDC Engine continuously monitors database transaction logs and stages change events for processing by Sync.

CDC Engine

The CDC Engine is a component of Sync that tracks and streams real-time data changes from a source database. By monitoring transaction logs (for example, write-ahead logs or redo logs), the CDC Engine identifies data updates and temporarily stores these change events as files in a staging area for further processing. This staging area is in the jobs folder of the application directory (C:\ProgramData\CData\sync\jobs).

Each CDC job that uses the CDC Engine has a dedicated engine instance with a unique subfolder in the staging area. For each table that is included in the job, Sync creates a corresponding subfolder that contains the staged change event files.

Managing the Stage File

To optimize the storage of change events, you can configure the stage.file.max.rows property, which specifies the maximum number of rows or lines that can be written to a single stage file. By default, this value is set to 100,000 rows. For jobs that involve tables with numerous columns or large objects (for example, binary data), CData recommends that you decrease the value of this property.

You can configure this property according to your table schema on the job’s Advanced tab, as shown in the following example:

Managing the Stage Limit

Because all CDC jobs share the same staging area, it is essential to manage the staging area’s size carefully. By default, the stage is limited to 10 GB to prevent disk overloading. You can modify this limit by adjusting the stagemaxsize property. To do so, navigate to Settings > Advanced > Additional Settings > Other Settings and configure the property to an appropriate value (for example, stagemaxsize=20), as shown below.

Note: The stagemaxsize property accepts only integer values, which are interpreted in gigabytes (GB) by default. 

If the staging area reaches its size limit, all CDC engines stop automatically. In such cases, you either must run the jobs manually or wait for the scheduled jobs to run in order to replicate the accumulated change events to the destination and release the staging area. If one-third of the maximum stage size has been freed, all the CDC engines start automatically to resume the real-time data streaming process.

As an option, you can restart a CDC engine by clicking the Start button on the Overview tab.

Note: Because the CDC engine operates continuously and streams change events in real-time, you should schedule the job to run frequently as a best practice. This practice ensures near real-time replication while preventing the staging area from reaching its size limit.

Enabling Change Data Capture for CDC Engine Sources

The steps that are required to enable Change Data Capture (CDC) vary by source database. For sources that use the CDC Engine, additional configuration might be required to allow Sync to capture changes from transaction logs or native change tracking mechanisms.

For specific setup instructions for your source, select the appropriate link below.

Creating a Job that uses the CDC Engine

Creating a job that uses the CDC Engine requires pre-configured source and destination connections. See Connections for information about adding source and destination connections.

After you define connections to your data source and your destination database, follow these steps to create a new job.

  1. Click Add Job > Add New Job on the Jobs tab in Sync. This action opens the Add Job dialog box.

  2. Enter your job name and select a source that supports CDC and a destination in the dialog box. For sources that use the CDC Engine, Sync automatically configures the job to use engine-based CDC.

  3. Depending on your source, fill in the appropriate properties.

  4. Click Add Job.

  5. Return to the Jobs page to access your job.

After you create a job that uses the CDC Engine, you add tasks in the same way as for any other CDC job.

For more information about creating tasks, see Tasks.

Understanding CDC Job Execution and Engine Behavior

Sync manages CDC job execution based on the state of the CDC engine and the availability of staged data. When you run a CDC job, Sync coordinates execution between the job and the CDC engine. Job behavior depends on the state of the CDC engine and the availability of staged change data.

CDC Engine Availability During Job Execution

Sync determines how a CDC job runs based on the state of the CDC engine and the availability of staged data:

  • CDC engine is running: The job runs normally and processes any available changes. If no changes are available, the job can complete successfully with no records affected.

  • CDC engine is not running, but stage files exist: The job runs and processes the existing stage files. This behavior enables previously captured changes to be applied even if the engine is not currently active.

  • CDC engine is not running and no stage files exist: The job fails because there is no data available to process, and the following error is displayed: There are no stage files to consume.

Automatic CDC Engine Restart

Before a CDC job runs, Sync attempts to ensure that the CDC engine is available. The following conditions determine how Sync responds:

  • If the CDC engine stops because of an error or an unexpected condition (and you did not stop it explicitly), Sync automatically attempts to restart the engine before the job runs.

  • If Sync cannot restart the CDC engine, job execution proceeds based on the availability of stage files. If no stage files are available, the job fails with the following error: There are no stage files to consume.

If these conditions prevent the CDC job from running successfully, you might need to reset the CDC engine, as explained in the next section.

Resetting the CDC Engine

In some cases, the Change Data Capture (CDC) engine cannot resume processing from its last recorded position. This situation can occur when required source database logs are no longer available or when the stored offset is no longer valid.

To simplify recovery, Sync provides a Reset option in the user interface that allows you to reset the CDC engine without manually accessing the server or deleting offset files. This Reset option is available for supported CDC sources.

You can reset the CDC engine for a job from the CDC Engine settings (JobName > Overview > CDC Engine), as follows:

  1. Open your job in Sync.

  2. On the Overview tab, locate the CDC Engine section.

  3. Click the Reset button.

    When you click the button, the Reset CDC Engine dialog box appears and prompts you to confirm the action.

    Click Reset in the dialog box to proceed. Be aware that resetting the CDC engine stops the engine and clears all staging files.

After You Reset the CDC Engine

When you reset the CDC engine, Sync performs the following actions:

  • stops the CDC engine if it is running

  • clears the stage folder of captured events

  • clears the stage table for the job

  • deletes the offset file

  • resets the number of processed rows to 0

  • removes CDC engine status records for the job’s tasks

When the reset is complete, Sync displays a confirmation message.

After you reset the CDC engine, the next job run performs a full snapshot of the source data. This behavior ensures that the destination is fully resynchronized with the source.

When to Reset the CDC Engine

You might need to reset the CDC engine in the following situations:

  • The CDC engine fails with errors indicating that required log files are no longer available.

  • Source database logs are removed because of retention policies or system changes.

  • The stored offset is no longer valid and the job cannot resume.

  • Automatic recovery does not resolve the issue.

Automatic Recovery for Out-of-Range Errors

For supported connectors, Sync can automatically recover from certain CDC failures. When the CDC engine detects an out-of-range condition (for example, when required logs are no longer available), Sync performs the following actions:

  • sets the CDC engine status to indicate the out-of-range condition

  • clears the stage

  • displays an error message in the user interface and temporarily disables the Start option while preparing for recovery

On the next job run, Sync restarts the CDC engine and refreshes the destination table instead of dropping and re-creating it. This approach preserves the destination schema and any downstream integrations that depend on the table.

Note: Automatic out-of-range recovery is currently supported only for the Oracle and Informix source connectors.

If automatic recovery does not resolve the issue, you can reset the CDC engine manually, as explained in the previous section.