Architecture


Architecture


Overview

CData Sync allows you to easily move data from applications, relational databases, and many other data sources to a data warehouse or data lake. As the ecosystem for productivity applications grows and businesses continue to increasingly make data-driven decisions, it is important to have a strategy around data management. Sync is designed to help move data easily and efficiently to whatever data store you choose.

Architecture

Core Concepts

  • Connections: In order to begin creating a data pipeline, you will first need to configure your source and destination connections. Sync comes pre-installed with a select list of popular sources and destinations that make transferring data easy. In addition, many more connectors are available to download from the CData website. For more information on creating connections, refer to the Creating Connections documentation. For information on connectors and a list of available sources and destinations, refer to the Connectors documentation.
  • Jobs: Sync jobs are a unit of execution that include the source connection, the destination connection, and a set of one or more tasks that represent data flows for different tables. All the tasks within a job are executed sequentially allowing you to order data flows as you see fit. The Jobs page in the admin console gives a listing of each configured job in the Sync app. From there, you can easily manage and monitor existing jobs while adding new jobs to the application. For more information, refer to the Jobs documentation.
  • Tasks: Tasks control the data flow from a source into a destination table. For more information, refer to the Tasks documentation.

Initial Load

The first time you run a job, Sync processes the entirety of the source’s historical data. Since this can be a huge amount of information, Sync uses a number of strategies to maximize efficiency, performance, and integrity. Sync also provides a number of user-controlled options to optimize the synchronization strategy for your specific data set.

Sync goes through the following steps when processing an initial load:

  • Replicate Start Date: Sync’s initial load starts at the sources’s minimum date, meaning Sync will begin moving data from the source’s earliest available records. Some APIs do not provide a way to request the minimum date for an entity. If no min date is available, it can be manually configured by setting the ReplicateStartDate. If no min date is available or manually configured, Sync defaults to moving all data.

    Note: To manually set the min date, navigate to the Job Settings page, click the Tables tab and then click on the table to open the Task Settings modal. In the Task Settings modal, click the Advanced tab. From this screen, you can set use the Replicate Start Date option to manually set the minimum start date (min date) to begin replicating data. The accepted date format is yyyy-MM-dd or yyyy-MM-dd HH:mm:ss.

  • Replicate Intervals: Once the Min Date is found, Sync will then move the remainder of your data by the defined interval until it reaches the end of the data. Continuing our previous example, after processing data from the min date to the user-defined six month integer, Sync will continue moving data in the same increments until it reaches the end.

    • Replicate Interval: Paired with Replicate Interval Unit, allows you to set the time frame to split the data when retrieving data. This interval is used to batch the updates so that if a failure occurs or the replication is interrupted, the next attempt can pick up where the last run left off. By default, Sync will use 180 days, but you can adjust this larger or smaller depending on the amount of data you have and how spread out the data is in terms of time.
    • Replicate Interval Unit: Paired with Replicate Interval, allows you to set the time frame to split the data when retrieving data. Accepted values are: minutes, hours, days, weeks, months, years.

Incremental Loads

After the initial load, Sync moves data via incremental loads. Incremental loads include only data that has been added or changed since the last run. This greatly reduces the workload, especially when dealing with large data sets. Many cloud systems use APIs, and pulling full data from those APIs into a data warehouse can often be a slow process. Many APIs also use daily quotas where you could not pull all the data if you wanted to each day, much less every hour or 15 minutes. By moving data in increments, Sync gives the user tremendous flexibility when dealing with slow APIs or daily quotas.

Sync uses two primary methods to move only updated data: Incremental Check Column and Change Tracking. For more information on Incremental Updates, refer to the Features documentation.

Data Integrity

As part of any data integration strategy, it is important to make sure the data is consistent between the original source and the destination. If an error occurs in your data pipeline, or if a job is interrupted, you need your data pipeline process to resume where it left off. This ensures that no data is lost between updates or in the event of an error. Sync automatically handles this for you without the need to configure any complex data loading scripts or processes.

Sync processes data in sync intervals. This means that, rather than attempting to move all of your data at once, Sync breaks the data into manageable intervals (or “chunks” of data), and processes each interval at a time. This greatly increases performance and allows Sync to maintain data integrity in the event of an error. Sync matches the source and destination tables and, if an error occurs, throws away all data from the current interval, and can restart processing at that point.

For example, imagine a large sync job has nearly completed when an error occurs. Instead of starting the entire job from the beginning, Sync restart from the last successful interval, saving time and resources.

Note: Some APIs have access limitations in place to restrict the number of times they can be accessed within a given period of time. These limitations can cause errors. In the event of such an error, Sync will discard the incomplete sync records and begin again from that point at the next scheduled job. Users can set the interval size, allowing them to dictate how much data is pulled down in each interval, and limiting the amount of data that needs to be retried if an error occurs.

Performance

Performance is critical to ensure business teams can access data in a timely fashion. When moving large amounts of data from sources that may have slow-responding APIs, Sync uses a number of features to improve data integrity and maximize job performance:

Best-in-Class Connectors

The demands created by analytics and cognitive computing mean applications must be able to retrieve and process massive data sets at speeds that were once inconceivable. Sync’s best-in-class connectors are over twice as fast as other SaaS, NoSQL & Big Data connectivity solutions, giving users the speed and reliability needed to operate at peak efficiency in the era of Big Data.

Batch Insert Into Temporary Table

Sync uses Batch Insert to insert multiple rows simultaneously. Rather than inserting one row at a time, Sync can insert a batch of rows into the database at once using a single statement, greatly reducing network inefficiency and increasing the speed with which jobs are processed.

Merge from Temporary Table Into Destination Table

When Sync moves data from the source to the destination, the data is first inserted into a temporary table before being integrated into the destination table via MERGE action. This improves performance and prevents Sync from locking the production table during the job run. In the event of an error, Sync can simply drop the temp table instead of potentially corrupting the production table.

Using a temporary table greatly improves Sync’s efficiency in moving data. As opposed to upserting, which requires you to read the target table each time per row, merge is executed within a single transaction, and will read the target one time and determine which rows will be an update vs. which will be an insert.

Parallel Processing

Jobs can be configured to use parallel processing, meaning it can distribute multiple worker threads to the same job. By employing multiple workers, Sync can divide its workload into multiple processes, allowing it to move more than one table simultaneously. Parallel processing greatly increases job efficiency by allowing Sync to move more data at once, in less time. Sync users can assign as many workers as they want on a per-job basis.

Multiple Concurrent Readers

For source connections that allow for it, Sync will create multiple concurrent readers to read data from the source API. This means that Sync is able to read data from the same table concurrently, further improving performance of moving data through your data pipeline.