ODBC Driver for Parquet

Build 24.0.9060

Linux DSN Configuration

This section describes how to set up ODBC connectivity and configure DSNs on several Linux distributions: Debian-based systems, like Ubuntu, and Red Hat Linux platforms, like Red Hat Enterprise Linux (RHEL) and Fedora.

Minimum Linux Versions

Here are the minimum supported versions for Red Hat-based and Debian-based systems:

OSMin. Version
Ubuntu18.04
Debian10
RHEL8
Fedora28
SUSE15

Installing the Driver Dependencies

Run the following commands as root or with sudo to install the necessary dependencies:

  • Debian/Ubuntu:
    apt-get install libc6 libstdc++6 zlib1g libgcc1
  • RHEL/Fedora:
    yum install glibc libstdc++ zlib libgcc

Installing the Driver

You can use standard package management systems to install the driver.

On Debian-based systems, like Ubuntu, run the following command with root or sudo:

dpkg -i /path/to/driver/setup/ParquetODBCDriverforUnix.deb 

On systems that support the RPM package format, run the following command with root or sudo:

rpm -ivh /path/to/driver/ParquetODBCDriverforUnix.rpm 

Licensing the Driver

Run the following commands to license the driver. To activate a trial, omit the <key> input.

cd /opt/cdata/cdata-odbc-driver-for-parquet/bin/
sudo ./install-license.sh <key>

Connecting through the Driver Manager

The driver manager loads the driver and passes function calls from the application to the driver. You need to register the driver with the driver manager and you define DSNs in the driver manager's configuration files.

The driver installation registers the driver with the unixODBC driver manager and creates a system DSN. The unixODBC driver manager can be used from Python and from many other applications. Your application may embed another driver manager.

Creating the DSN

See Using unixODBC to install unixODBC and configure DSNs. See Using the DataDirect Driver Manager to create a DSN to connect to OBIEE, Informatica, and SAS.

The CData ODBC Driver for Parquet allows connecting to local and remote Parquet resources. Set the URI property to the Parquet resource location, in addition to any other properties necessary to connect to your data source.

Connecting to Local Files

Set the ConnectionType to Local. Local files support SELECT\INSERT.

Set the URI to a folder containing Parquet files: C:\folder1.

You can also connect to multiple Parquet files which share the same schema. Below is an example connection string:

URI=C:\folder; AggregateFiles=True;

If you would prefer to expose all of the individual Parquet files as tables instead, leave this property False.

URI=C:\folder; AggregateFiles=False;

Connecting to Cloud-Hosted Parquet Files

While the driver is capable of pulling data from Parquet files hosted on a variety of cloud data stores, INSERT, UPDATE, and DELETE are not supported outside of local files in this driver.

If you need INSERT/UPDATE/DELETE cloud files, you can download the corresponding CData driver for that cloud host (supported via stored procedures), make changes with the local file's corresponding driver, then upload the file using the cloud source's stored procedures.

As an example, if you wanted to update a file stored on SharePoint, you could use the CData SharePoint driver's DownloadDocument procedure to download the Parquet file, update the local Parquet file with the CData Parquet driver, then use the SharePoint driver's UploadDocument procedure to upload the changed file to SharePoint.

A unique prefix at the beginning of the URI connection property is used to identify the cloud data store being targed by the driver and the remainder of the path is a relative path to the desired folder (one table per file) or single file (a single table).

Amazon S3

Set the following to identify your Parquet resources stored on Amazon S3:

  • ConnectionType: Set the ConnectionType to Amazon S3.
  • URI: Set this to the bucket and folder: s3://bucket1/folder1.
    • You can also connect to Parquet resources stored on Cloudera Ozone, after creating a volume and bucket and making a symbolic link to that bucket: s3://linktobucket/

See Connecting to Amazon S3 for more information regarding how to connect and authenticate to Parquet files hosted on Amazon S3.

Azure Blob Storage

Set the following to identify your Parquet resources stored on Azure Blob Storage:

  • ConnectionType: Set this to Azure Blob Storage.
  • URI: Set this to the name of your container and the name of the blob. For example: azureblob://mycontainer/myblob.

See Connecting to Azure Blob Storage for more information regarding how to connect and authenticate to Parquet files hosted on Amazon Blob Storage.

Azure Data Lake Storage

Set the following to identify your Parquet resources stored on Azure Data Lake Storage:

  • ConnectionType: Set this to Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, or Azure Data Lake Storage Gen2 SSL.
  • URI: Set this to the name of the file system and the name of the folder which contains your Parquet files. For example:
    • Gen 1: adl://myfilesystem/folder1
    • Gen 2: abfs://myfilesystem/folder1
    • Gen 2 SSL: abfss://myfilesystem/folder1

See Connecting to Azure Data Lake Storage for more information regarding how to connect and authenticate to Parquet files hosted on Azure Data Lake Storage.

Azure File Storage

Set the following properties to connect:

  • ConnectionType: Set this to Azure Files.
  • URI: Set this the name of your azure file share and the name of the resource. For example: azurefile://fileShare/remotePath.
  • AzureStorageAccount (Required): Set this to the account associated with the Azure file.

You can authenticate either an Azure access key or an Azure shared access signature. Set one of the following:

  • AzureAccessKey: Set this to the access key associated with the Azure file.
  • AzureSharedAccessSignature: Set this to the shared access signature associated with the Azure file.

Box

Set the following to identify your Parquet resources stored on Box:

  • ConnectionType: Set this to Box.
  • URI: Set this the name of the file system and the name of the folder which contains your Parquet files. For example: box://folder1.

See Connecting to Box for more information regarding how to connect and authenticate to Parquet files hosted on Box.

Dropbox

Set the following to identify your Parquet resources stored on Dropbox:

  • ConnectionType: Set this to Dropbox.
  • URI: Set this to the path to a folder containing Parquet files. For example: dropbox://folder1.

See Connecting to Dropbox for more information regarding how to connect and authenticate to Parquet files hosted on Dropbox.

FTP

The driver supports both plaintext and SSL/TLS connections to FTP servers.

Set the following connection properties to connect:

  • ConnectionType: Set this to either FTP or FTPS.
  • URI: Set this to the address of the server followed by the path to the folder to be used as the root folder. For example: ftp://localhost:990/folder1 or ftps://localhost:990/folder1.
  • User: Set this to your username on the FTP(S) server you want to connect to.
  • Password: Set this to your password on the FTP(S) server you want to connect to.

Google Cloud Storage

Set the following to identify your Parquet resources stored on Google Cloud Storage:

  • ConnectionType: Set this to Google Cloud Storage.
  • URI: Set this to the path to the name of the file system and the name of the folder which contains your Parquet files. For example: gs://bucket/remotePath.

See Connecting to Google Cloud Storage for more information regarding how to connect and authenticate to Parquet files hosted on Google Cloud Storage.

Google Drive

Set the following to identify your Parquet resources stored on Google Drive:

  • ConnectionType: Set this to Google Drive.
  • URI: Set to the path to the name of the file system and the name of the folder which contains your Parquet files. For example: gdrive://folder1.

See Connecting to Google Drive for more information regarding how to connect and authenticate to Parquet files hosted on Google Drive.

HDFS

Set the following to identify your Parquet resources stored on HDFS:

  • ConnectionType: Set this to HDFS or HDFS Secure.
  • URI: Set this to the path to a folder containing Parquet files. For example:
    • HDFS: webhdfs://host:port/remotePath
    • HDFS Secure: webhdfss://host:port/remotePath
    • Cloudera Ozone (via the HttpFS gateway): webhdfs://<Ozone server>:<port>/user/myuser
      • You must use Kerberos authentication to access Parquet files stored on Ozone.
      • Ensure that you have Ozone 718.2.x on the Ozone cluster.
      • Cloudera Manager version 7.10.1 is required.

There are two authentication methods available for connecting to HDFS data source, Anonymous Authentication and Negotiate (Kerberos) Authentication.

Anonymous Authentication

In some situations, you can connect to HDFS without any authentication connection properties. To do so, set the AuthScheme property to None (default).

Authenticate using Kerberos

When authentication credentials are required, you can use Kerberos for authentication. See Using Kerberos for details on how to authenticate with Kerberos.

HTTP Streams

Set the following to identify your Parquet resources stored on HTTP streams:

  • ConnectionType: Set this to HTTP or HTTPS.
  • URI: Set this to the URI of your HTTP(S) stream. For example:
    • HTTP: http://remoteStream
    • HTTPS: https://remoteStream

See Connecting to HTTP Streams for more information regarding how to connect and authenticate to Parquet files hosted on HTTP Streams.

IBM Cloud Object Storage

Set the following to identify your Parquet resources stored on IBM Cloud Object Storage:

  • ConnectionType: Set this to IBM Object Storage Source.
  • URI: Set this to the bucket and folder. For example: ibmobjectstorage://bucket1/remotePath.
  • Region: Set this property to your IBM instance region. For example: eu-gb.

See Connecting to IBM Object Storage for more information regarding how to connect and authenticate to Parquet files hosted on IBM Cloud Object Storage.

OneDrive

Set the following to identify your Parquet resources stored on OneDrive:

  • ConnectionType: Set this to OneDrive.
  • URI: Set this to the path to a folder containing Parquet files. For example: onedrive://remotePath.

See Connecting to OneDrive for more information regarding how to connect and authenticate to Parquet files hosted on OneDrive.

OneLake

Set the following to identify your Parquet resources stored on OneLake:

  • ConnectionType: Set this to OneLake.
  • URI: Set this to the name of the workspace, followed by the item and item type. Optionally, include the folder path to be used as the root folder. For example: onelake://Workspace/Test.LakeHouse/Files/CustomFolder.

See Connecting to OneLake for more information regarding how to connect and authenticate to Parquet files hosted on OneLake.

Oracle Cloud Storage

Set the following properties to authenticate with HMAC:

  • ConnectionType: Set the ConnectionType to Oracle Cloud Storage.
  • URI: Set this to the bucket and folder: os://bucket/remotePath.
  • AccessKey: Set this to an Oracle Cloud Access Key.
  • SecretKey: Set this to an Oracle Cloud Secret Key.
  • OracleNamespace: Set this to an Oracle cloud namespace.
  • Region (optional): Set this to the hosting region for your S3-like Web Services.

SFTP

Set the following to identify your Parquet resources stored on SFTP:

  • ConnectionType: Set this to SFTP.
  • URI: Set this to the address of the server followed by the path to the folder to be used as the root folder. For example: sftp://server:port/remotePath.

See Connecting to SFTP for more information regarding how to connect and authenticate to Parquet files hosted on SFTP.

SharePoint Online

Set the following to identify your Parquet resources stored on SharePoint Online:

  • ConnectionType: Set this to SharePoint REST or SharePoint SOAP.
  • URI: Set this to a document library containing Parquet files. For example:
    • SharePoint Online REST: sprest://remotePath
    • SharePoint Online SOAP: sp://remotePath

See Connecting to SharePoint Online for more information regarding how to connect and authenticate to Parquet files hosted on SharePoint Online.

Refreshing OAuth Values

The driver can refresh the temporary OAuth access tokens obtained during the browser-based OAuth authentication exchange. By default, the driver saves the encrypted tokens in the odbc.ini file corresponding to the DSN. Access to this odbc.ini file can be restricted in the case of System DSNs.

To enable the automatic token exchange, you can give the driver write access to the system odbc.ini. Or, you can set the OAuthSettingsLocation connection property to an alternate file path, to which the driver would have read and write access.

    OAuthSettingsLocation=/tmp/oauthsettings.txt
    

Installing Dependencies for OAuth Authentication

The OAuth authentication standard requires the authenticating user to interact with Parquet, using a web-browser. If the first OAuth interaction is to be done on the same machine the driver is installed on, for example, a desktop application, the driver needs access to the xdg-open program, which opens the default browser.

To satisfy this dependency, install the corresponding package with your package manager:

Debian/Ubuntu PackageRHEL/Fedora PackageFile
xdg-utilsxdg-utilsxdg-open

Set the Driver Encoding

The ODBC drivers need to specify which encoding to use with the ODBC Driver Manager. By default, the CData ODBC Drivers for Unix are configured to use UTF-16 which is compatible with unixODBC, but other Driver Managers may require alternative encoding.

Alternatively, if you are using the ODBC driver from an application that uses the ANSI ODBC API it may be necessary to set the ANSI code page. For example, to import Japanese characters in an ANSI application, you can specify the code page in the config file '/opt/cdata/cdata-odbc-driver-for-parquet/lib/cdata.odbc.parquet.ini':

[Driver]
AnsiCodePage = 932

Copyright (c) 2024 CData Software, Inc. - All rights reserved.
Build 24.0.9060