Amazon S3
Version 25.3.9396
Version 25.3.9396
Amazon S3
You can use the Amazon S3 connector from the CData Sync application to move data from any supported source to the Amazon S3 destination. To do so, you need to add the connector, authenticate to the connector, and complete your connection.
Prerequisites
Before you configure the Amazon S3 destination with the Delta Parquet file format in the Microsoft Windows operating system (OS), make sure that your environment meets the requirements explained below. These prerequisites ensure that Sync can interact correctly with Delta Lake by locating the required Hadoop binaries under Windows.
Configure your Windows OS, as follows:
-
Download Hadoop binaries (recommended version: 2.8.1 or later).
-
Ensure that
HADOOP_HOMEenvironment variable to the Hadoop installation directory. -
Ensure that
%HADOOP_HOME%\binis included in yourPATHsystem variable (specifically,%HADOOP_HOME%\bin\winutils.exemust be accessible).
This configuration is necessary because Delta Lake (on Spark) uses Hadoop’s file system APIs to access local storage, the Hadoop Distributed File System (HDFS), and cloud object stores like Amazon S3. Under Windows, Spark must be able to locate the Hadoop binaries (including winutils.exe and other native libraries) to function correctly. Without this configuration, operations such as writing Delta tables, managing checkpoints, or accessing cloud storage can fail with permission or file-system errors.
Supported File Formats
When Sync writes data to Amazon S3, you can choose the file format for the exported data. The following file formats are supported for the Amazon S3 destination:
-
(Default) Delta Parquet—A format that uses a Delta Lake storage layer on top of the Parquet file format that is used by Sync to support delta processing. Delta processing is a method where, after your initial job run, only new or modified files are written or read in subsequent runs, which can reduce job times and resource use.
Limitations:
-
Naming restrictions: Table and column names cannot include special characters or reserved SQL and Delta Lake keywords. Examples of special characters include spaces, commas, semicolons, braces, parentheses, equal signs, and the newline (
\n) and tab (\t) characters. -
Primary keys: Primary key constraints are not supported. Sync uses the source primary keys for incremental replication.
-
Data types: Unlike traditional databases, Delta Lake does not support column-size definitions (for example,
VARCHAR(100)). It supports only a fixed set of data types and allows type widening when necessary. -
Schema changes: The ALTER TABLE command supports only adding new columns. Changing the data type of an existing column (for example, from INT to VARCHAR) is not supported.
-
Delete operations: In standard jobs, both hard and soft deletions are supported. In CDC and enhanced CDC jobs, only soft deletions are supported.
-
-
Parquet—A columnar storage format that is optimized for analytics.
-
CSV—Plain text comma-separated values.
-
Avro—A row-based binary format that supports schema evolution.
Add the Amazon S3 Connector
To enable Sync to use data from Amazon S3, you first must add the connector, as follows:
-
Open the Connections page of the Sync dashboard.
-
Click Add Connection to open the Select Connectors page.
-
Click the Destinations tab and locate the Amazon S3 row.
-
Click the Configure Connection icon at the end of that row to open the New Connection page. If the Configure Connection icon is not available, click the Download Connector icon to install the Amazon S3 connector. For more information about installing new connectors, see Connections.
Authenticate to Amazon S3
After you add the connector, you need to set the required properties.
-
Connection Name: Enter a connection name of your choice.
-
File Format: Select the file format that you want to use: Delta Parquet (default), CSV , Avro, or Parquet.
-
URI: Enter the path of your bucket and folder (for example,
s3://BucketName/FolderName). -
AWS Region: Select the hosting region for Amazon Web Services. The default region is NORTHERNVIRGINIA.
CData Sync supports authenticating to Amazon S3 in several ways, based on the file format that you select.
- AWS Root Keys (default)
- AWS EC2 Roles
- AWS IAM Roles
- Active Directory Federation Services
- Okta
- PingFederate
- AWS Temporary Credentials
- AWS Credentials File
- Azure Active Directory
Note: The authentication methods above are for all file formats except Delta Parquet.) That format uses only the AWS Root Keys method.
AWS Root Keys
To connect with your account root credentials, specify the following properties:
-
Auth Scheme: Select AwsRootKeys.
-
AWS Access Key: Enter your Amazon Web Services (AWS) account access key. You can locate this value on your AWS security credentials page.
-
AWS Secret Key: Enter your AWS account secret key. You can locate this value on your AWS security credentials page.
-
(Optional) MFA Serial Number: Enter the serial number for your multifactor authentication (MFA) device, if you are using such a device.
-
(Optional) MFA Token: Enter the temporary token that is available from your MFA device.
-
Temporary Token Duration: Enter the duration, in seconds, that you want for your temporary credentials. The default duration is 3600.
AWS EC2 Roles
When you run CData Sync on an EC2 instance, CData Sync can authenticate by using the IAM role that is assigned to the instance. Select AwsEC2Roles for Auth Scheme to use that role. No additional properties are required.
AWS IAM Roles
To connect with your IAM user credentials, specify the following properties:
-
Auth Scheme: Select AwsIAMRoles.
-
AWS Access Key: Enter your Amazon Web Services (AWS) account access key. You can locate this value on your AWS security credentials page.
-
AWS Secret Key: Enter your AWS account secret key. You can locate this value on your AWS security credentials page.
-
AWS Role ARN: Enter the Amazon Resource Name (ARN) for the role with which you want to authenticate.
-
(Optional) AWS External Id: Enter the unique identifier that is required when you assume a role in another account.
-
(Optional) MFA Serial Number: Enter the serial number for your multifactor authentication (MFA) device, if you are using such a device.
-
(Optional) MFA Token: Enter the temporary token that is available from your MFA device.
-
Temporary Token Duration: Enter the duration, in seconds, that you want for your temporary credentials. The default duration is 3600.
Active Directory Federation Services
To connect with single sign-on (SSO) via Active Directory Federation Services (ADFS), specify the following properties:
-
Auth Scheme: Select ADFS.
-
User: Enter the username that you use to authenticate to your ADFS account.
-
Password: Enter the password that you use to authenticate to your ADFS account.
-
SSO Login URL: Enter the login URL that is used by your SSO provider.
-
Use Lake Formation: Select True if you want the AWS Lake Formation service to retrieve temporary credentials. These temporary credentials enforce access policies against the user based on the configured IAM role. You can use this service when you authenticate through AzureAD, Okta, ADFS, and PingFederate, while providing a Security Assertion Markup Language (SAML) assertion. The default setting for Use Lake Formation is False.
-
(Optional) SSO Properties: Enter a semicolon-separated list of the single sign-on (SSO) properties that you want to use (for example, SSOProperty1=Value1;SSOProperty2=Value2;…).
Okta
To connect with single sign-on (SSO) via Okta, specify the following properties:
-
Auth Scheme: Select Okta.
-
User: Enter the username that you use to authenticate to your Okta account.
-
Password: Enter the password that you use to authenticate to your Okta account.
-
SSO Login URL: Enter the login URL that is used by your SSO provider.
-
Use Lake Formation: Select True if you want the AWS Lake Formation service to retrieve temporary credentials. These temporary credentials enforce access policies against the user based on the configured IAM role. You can use this service when you authenticate through AzureAD Okta, ADFS, and PingFederate, while providing a Security Assertion Markup Language (SAML) assertion. The default setting for Use Lake Formation is False.
-
(Optional) SSO Properties: Enter a semicolon-separated list of the single sign-on (SSO) properties that you want to use (for example, SSOProperty1=Value1;SSOProperty2=Value2;…).
PingFederate
-
Auth Scheme: Select PingFederate.
-
User: Enter the username that you use to authenticate to your PingFederate account.
-
Password: Enter the password that you use to authenticate to your PingFederate account.
-
SSO Login URL Enter the login URL that is used by your SSO provider.
-
SSO Exchange UrI: Enter the Partner Service Identifier URI that is configured in your PingFederate server instance. The URI is available under SP Connections > SP Connection > WS-Trust > Protocol Settings.
-
Use Lake Formation: Select True if you want the AWS Lake Formation service to retrieve temporary credentials. These temporary credentials enforce access policies against the user based on the configured IAM role. You can use this service when you authenticate through AzureAD, Okta, ADFS, and PingFederate, while providing a Security Assertion Markup Language (SAML) assertion. The default setting for Use Lake Formation is False.
-
(Optional) AWS Principal ARN: The Amazon Resource Name (ARN) of the Security Assertion Markup Language (SAML) identity provider in your AWS account.
-
(Optional) SSO Properties: Enter a comma-separated list of the single sign-on (SSO) properties that you want to use (for example, SSOProperty1=Value1;SSOProperty2=Value2;…).
AWS Temporary Credentials
To connect with AWS temporary credentials, specify the following properties:
-
Auth Scheme: Select AwsTempCredentials.
-
AWS Access Key: Enter the access key that is associated with your Amazon Web Services (AWS) account. This value is accessible from your AWS security credentials page.
-
AWS Secret Key: Enter the secret key that is associated with your AWS account. This value is accessible from your AWS security credentials page.
-
AWS Session Token: Enter your AWS session token. This token is provided with your temporary credentials. For more information, see AWS Identity and Access Management: User Guide.
AWS Credentials File
To connect with a credentials file, specify the following properties:
-
Auth Scheme - Select AwsCredentialsFile.
-
AWS Credentials File - Enter the location of your Amazon Web Services (AWS) credentials file.
-
AWS Credentials File Profile (optional) - Enter the name of the AWS profile that you want to use from the credentials file that you specify. If you do not enter a profile name, Sync uses the profile named default.
Azure Active Directory
To connect with an Azure Active Directory (AD) user account, specify the following properties:
-
Auth Scheme: Select AzureAD.
-
Use Lake Formation: Select True if you want the AWS Lake Formation service to retrieve temporary credentials. These temporary credentials enforce access policies against the user based on the configured IAM role. You can use this service when you authenticate through AzureAD, Okta, ADFS, and PingFederate, while providing a Security Assertion Markup Language (SAML) assertion. The default setting for Use Lake Formation is False.
-
OAuth Client Id: Enter the client Id that you were assigned when you registered your application with an OAuth authorization server.
-
OAuth Client Secret: Enter the client secret that you were assigned when you registered your application with an OAuth authorization server.
Complete Your Connection
To complete your connection:
-
Specify the following properties:
For all formats:
- (Optional) Storage Base URL: Enter the URL of your cloud-storage service provider.
For the Delta Parquet file and CSV formats only:
-
FMT: Enter the format that you want to use to parse all text files. The default format is CsvDelimited
-
Aggregate Files: Specify whether you want to aggregate all the files that are located in the URI directory and that have the same schema into a single table named AggregatedFiles. The default option is False.
-
Include Column Headers: Specify whether you want to obtain column headers from the first lines of the specified files. The default option is True.
For all file formats except Delta Parquet:
-
Data Model: Select the data model that you want to use to parse documents for your format and to generate the database metadata. The default data model is Document.
-
Aggregate Files: Specify whether you want to aggregate all the files that are located in the URI directory and that have the same schema into a single table named AggregatedFiles. The default option is False.
-
Define advanced connection settings on the Advanced tab. (In most cases, though, you should not need these settings.)
-
Click Create & Test to create your connection.
More Information
For more information about interactions between CData Sync and Amazon S3, see Amazon S3 Connector for CData Sync.