Fine-Tuning Data Access
The CData JDBC Driver for XML enables the granular control useful in more complex deployments or network topologies; you can use the following connection properties to fine-tune data access, connect through a firewall, or troubleshoot connections.
Parsing and Merging Multiple Files Into A Single Table
The driver supports reading and parsing multiple files within a single directory and merging the data into a single result set.
To enable this feature, the URI property can be set to a directory with a file mask (e.g. C:\MyDataFiles\*.xml) or a directory (e.g. C:\MyDataFiles). When a directory is specified as the URI, IncludeFiles will be used to identify the file types to include.
This functionality is also available on cloud based service providers such as Google Drive (e.g. gdrive://remotepath/*.xml), Amazon S3 (e.g. s3://remotepath/*.xml), FTP (ftp://server:port/remotepath/*.xml), etc.
When setting URI to a directory, be sure to include an ending forward slash (e.g. s3://remotepath/) to denote that its a directory.
A comma-separated list of URIs is supported to include multiple files.
To retrieve all files (including those without a file extension), you can use a file mask of '*' (e.g. s3://remotepath/*) or set IncludeFiles to include a '*' entry.
To include just files without an extension, you can set IncludeFiles to include a 'NOEXT' entry.
When parsing a batch of files, the files are put into a queue.
Each file will be retrieved and parsed in a streaming fashion with each row pushed as it is parsed.
Therefore files will never be stored in memory or stored in a temporary location on disk,
thus limiting the amount of memory usage required.
Metadata Discovery
When reading multiple files, the driver will use the first file identified to discovery metadata.
A single file is used for metadata discovery for performance reasons.
In cases where the first identified file contains partial XML data
(as compared to the other files in the directory), columns/data for the other files may not be returned.
This is due to the driver not being able to properly identify all the columns available in the selected files from the single file used for
metadata discovery.
To work around these cases, a "MetadataDiscoveryURI" option can be set via the Other property (e.g. MetadataDiscoveryURI=file:///C:\MyDataFiles\main.xml).
When a MetadataDiscoveryURI is specified, the driver will
use the specified URI to discover tables and columns.
Once the metadata has been discovered, the URI value will be used to retrieve and parse the XML data.
Error Handling
When parsing multiple files, there may be cases where a file is not able to be parsed (such as invalid XML, a network connectivity issue, etc.).
In such cases, the driver will not return an exception message but rather will log the error in a
temporary table called ErrorInfo#TEMP.
This is done so that failures don't occur in the middle of reading a large batch of files and having to start over.
Instead the temporary table (ErrorInfo#TEMP) can be queried after the initial query has finished.
This will allow you to identify if there were any files that failed to parse, thus allowing you to retry the failed requests and merging them with your primary result set.
To retrieve the error list, you can issue the following query: SELECT * FROM ErrorInfo#TEMP.
This table contains two columns: URI and Description.
URI contains the URI for the single file that failed (which can be set via the URI property to retry).
Description contains the description as to why the file failed to parse. In the case that no errors occurred, an empty result set will be returned.
Navigating Subdirectories
The driver supports navigating subdirectories when parsing local files.
This functionality is exposed by using the '*' wildcard character in the directory name of the URI value.
For example, suppose you have a 'Data' directory that contains folders with unique names (such as a date value) each of which contain similar XML data files.
You can read all these files into a single table by setting URI
to 'file:///C:\Data\*\*.xml' or you can set it to the directory without a file mask 'file:///C:\Data\*'.
Partial directory matching is also supported.
For example, to retrieve all XML files within folders starting with '2018' the following URI value can be used: file:///C:\Data\2018*\*.xml.
Note: Using wildcards in directory names is not applicable when connecting to cloud resources.