The Parquet connector enables exporting data in Parquet format to the local filesystem.

Parquet Connector Data Source Creation

CALL SYSADMIN.createConnection(name => <parquetalias>, jbossCLITemplateName => 'ufile', connectionOrResourceAdapterProperties => 'ParentDirectory="directory"') ;;
CALL SYSADMIN.createDataSource(name => <parquetalias>, translator => 'parquet', modelProperties => null, translatorProperties => null) ;;

The parquet translator is compatible with several connectors, i.e. different storage types can be used for .parquet files:

  • ufile - local file storage
  • ftp - FTP file storage
  • sftp - SFTP file storage
  • scp - SCP file storage
  • s3 - Amazon S3 file storage
  • blob - Azure Blob file storage

ufile Connector

CALL SYSADMIN.createConnection(name => 'parquet_ufile', jbossCLITemplateName => 'ufile', connectionOrResourceAdapterProperties => 'ParentDirectory="D:/parquet"') ;;
CALL SYSADMIN.createDataSource(name => 'parquet_ufile', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => null) ;;

ftp Connector

CALL SYSADMIN.createConnection(name => 'parquet_ftp', jbossCLITemplateName => 'ftp', connectionOrResourceAdapterProperties => 'host=localhost,port=21,secure=false,explicitTls=false,passive=true,user=<ftpUser>,password=<password>') ;;
CALL SYSADMIN.createDataSource(name => 'parquet_ftp', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => null) ;;

sftp Connector

CALL SYSADMIN.createConnection(name => 'parquet_sftp', "jbossCLITemplateName" => 'sftp', "connectionOrResourceAdapterProperties" => 'host=localhost,port=2022,user=<ftpUser>,password=<password>', "encryptedProperties" => '') ;;
CALL SYSADMIN.createDatasource(name => 'parquet_sftp', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => '', encryptedModelProperties => '', encryptedTranslatorProperties => '');;

scp Connector

CALL SYSADMIN.createConnection(name => 'parquet_scp', "jbossCLITemplateName" => 'scp', "connectionOrResourceAdapterProperties" => 'port=2022,host=localhost,decompressCompressedFiles=false,user=<ftpUser>,password=<password>', "encryptedProperties" => '') ;;
CALL SYSADMIN.createDatasource(name => 'parquet_scp', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => '', encryptedModelProperties => '', encryptedTranslatorProperties => '');;

s3 Connector

CALL SYSADMIN.createConnection(name => 'parquet_s3', jbossCLITemplateName => 's3', connectionOrResourceAdapterProperties => 'region=<region>,keyId=<keyId>,secretKey=<secretKey>,bucketName=<bucketName>');;
CALL SYSADMIN.createDatasource(name => 'parquet_s3', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => '', encryptedModelProperties => '', encryptedTranslatorProperties => '');;

blob Connector

CALL SYSADMIN.createConnection(name => 'parquet_blob', jbossCLITemplateName => 'blob', connectionOrResourceAdapterProperties => 'accountName=<accountName>,accountKey=<accountKey>,defaultEndpointsProtocol=https,containerName=<containerName>') ;;
CALL SYSADMIN.createDatasource(name => 'parquet_blob', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => '', encryptedModelProperties => '', encryptedTranslatorProperties => '');;

Model Properties

Name

Description

Default value

importer.loadMetadata

When set to TRUE, the data source will load the metadata of the tables that were present in the folder prior to data source creation

FALSE

Translator Properties

Name

Description

Default value

compression

Compression method for parquet file format. Possible values: UNCOMPRESSED, GZIP, SNAPPY, ZSTD 

Only applies to writing to files. Files compressed differently from the one configured can still be read

GZIP

writeSingleFile

When set to TRUE, tables created on the CData Virtuality Server side are recorded as single files on the storage side. When new data is inserted into the table the file is overwritten.

When set to FALSE, tables created on the CData Virtuality Server side are recorded as directories with multiple files with specific naming:

  • the directory is named <table_name>.parquet;
  • files within the directory are named <table_name>_<UID>.parquet.

When new data is inserted into the table a new file is created.

This setting only applies to creating new tables/files or inserting data

FALSE

Usage

The Parquet connector can manage data represented as single files or collections of files within a folder. Files created outside the CData Virtuality Server will still be handled as tables:

  • single file with a .parquet extension will be represented as a table; when new data is inserted the file is overwritten;
  • multiple files with a .parquet extension within a folder with a .parquet extention will be treated as a single table with the same name as the folder, file naming inside the folder does not matter; new files will be added upon insert.

New files will be created according to the writeSingleFile translator property.

Data is exported using the SELECT INTO command:

SELECT *
INTO <parquet data source name>.<table name>
FROM ...

The data will be exported into the folder specified in the path connection property. The table is represented by a folder named according to the following pattern: <parquet data source name>_<table name>.parquet. The folder contains files named like <table name>_<UID>.parquet. When new data is inserted into a table, a new file is created in the respective table folder with new data appended to the old data.

You can also create a table using the CREATE TABLE statement. However, the physical file will only be created when some data is inserted into this table using the INSERT VALUES or INSERT SELECT statement.

Example

CALL SYSADMIN.createConnection(name => 'parquet_1', jbossCLITemplateName => 'ufile', connectionOrResourceAdapterProperties => 'ParentDirectory="/home/exportuser/examples"') ;;
CALL SYSADMIN.createDataSource(name => 'parquet_1', translator => 'parquet', modelProperties => 'importer.loadMetadata=true', translatorProperties => null) ;;
 
SELECT *
INTO parquet_1.example_salesorderdetail
FROM adventurework.salesorderdetail ;;

As a result of this call, the content of the salesorderdetail table in the adventureworks schema will be exported into a file named something like example_salesorderdetail_1e04e8d5-f963-11ed-a1bc-0a0027000003.parquet in the /home/exportuser/examples/parquet_1.example_salesorderdetail.parquet folder.

The following changes were introduced in v.3.9:

  • ufile jbossCLITemplateName is used for creating Parquet data sources;
  • importer.loadMetadata model property is available;
  • Tables are stored in dedicated folders;
  • Files are not re-written when inserting data;
  • Reading from Parquet tables is possible.

compression translator property available since v4.5

See Also

Parquet File Creation and S3 Storage with Data Virtuality to learn how to take any data source table and create a local Parquet file.

Query Parquet Files in Data Virtuality Using Amazon Athena for information on how to read from Parquet.