Fine-Tuning Data Access
The CData Python Connector for XML enables the kind of granular control that is useful in more complex deployments or network topologies; you can use the following connection properties to fine-tune data access, connect through a firewall, or troubleshoot connections.
Resource location
The URI should be used to specify an XML resource location. Set the URI property to specify one of the following sources:- An empty value automatically assigns the URI to a reference to the current directory, "./". The explicit path to the XML folders depends on the environment of the running application.
- A path to a folder.
- A path to a .zip, .tar. or .gz archive file.
- Include the file, not just the containing directory. For example: C:\Users\Public\Documents\CSVdata.zip
- A path to a file or stream - in this case you can query the file by executing SELECT * FROM streamedtable.
Modeling tables
Set the following properties to control how the connector models XML as tables:
- IncludeFiles: Set this to a comma-separated list of file extensions to include into the set of files modeled as tables. (By default, .xml and .txt files are modeled.)
- Specify files by their file extensions in all-caps, without the '.'. For example: "XML,TXT".
- Archive files are supported (ZIP, TAR, and GZ) and are modeled as if they were folders.
- RowScanDepth: Set this to automatically determine data types by scanning rows up to the specified depth.
Parsing and Merging Multiple Files Into A Single Table
The connector supports reading and parsing multiple files within a single directory and merging the data into a single result set.
To enable this feature, the URI property can be set to a directory with a file mask (e.g. C:\MyDataFiles\*.xml) or a directory (e.g. C:\MyDataFiles). When a directory is specified as the URI, IncludeFiles will be used to identify the file types to include.
This functionality is also available on cloud based service providers such as Google Drive (e.g. gdrive://remotepath/*.xml), Amazon S3 (e.g. s3://remotepath/*.xml), FTP (ftp://server:port/remotepath/*.xml), etc.
When setting URI to a directory, be sure to include an ending forward slash (e.g. s3://remotepath/) to denote that its a directory.
A comma-separated list of URIs is supported to include multiple files.
To retrieve all files (including those without a file extension), you can use a file mask of '*' (e.g. s3://remotepath/*) or set IncludeFiles to include a '*' entry.
To include just files without an extension, you can set IncludeFiles to include a 'NOEXT' entry.
When parsing a batch of files, the files are put into a queue.
Each file will be retrieved and parsed in a streaming fashion with each row pushed as it is parsed.
Therefore files will never be stored in memory or stored in a temporary location on disk,
thus limiting the amount of memory usage required.
Metadata Discovery
When reading multiple files, the connector will use the first file identified to discovery metadata.
A single file is used for metadata discovery for performance reasons.
In cases where the first identified file contains partial XML data
(as compared to the other files in the directory), columns/data for the other files may not be returned.
This is due to the connector not being able to properly identify all the columns available in the selected files from the single file used for
metadata discovery.
To work around these cases, a "MetadataDiscoveryURI" option can be set via the Other property (e.g. MetadataDiscoveryURI=file:///C:\MyDataFiles\main.xml).
When a MetadataDiscoveryURI is specified, the connector will
use the specified URI to discover tables and columns.
Once the metadata has been discovered, the URI value will be used to retrieve and parse the XML data.
Error Handling
When parsing multiple files, there may be cases where a file is not able to be parsed (such as invalid XML, a network connectivity issue, etc.).
In such cases, the connector will not return an exception message but rather will log the error in a
temporary table called ErrorInfo#TEMP.
This is done so that failures don't occur in the middle of reading a large batch of files and having to start over.
Instead the temporary table (ErrorInfo#TEMP) can be queried after the initial query has finished.
This will allow you to identify if there were any files that failed to parse, thus allowing you to retry the failed requests and merging them with your primary result set.
To retrieve the error list, you can issue the following query: SELECT * FROM ErrorInfo#TEMP.
This table contains two columns: URI and Description.
URI contains the URI for the single file that failed (which can be set via the URI property to retry).
Description contains the description as to why the file failed to parse. In the case that no errors occurred, an empty result set will be returned.
Navigating Subdirectories
The connector supports navigating subdirectories when parsing local files.
This functionality is exposed by using the '*' wildcard character in the directory name of the URI value.
For example, suppose you have a 'Data' directory that contains folders with unique names (such as a date value) each of which contain similar XML data files.
You can read all these files into a single table by setting URI
to 'file:///C:\Data\*\*.xml' or you can set it to the directory without a file mask 'file:///C:\Data\*'.
Partial directory matching is also supported.
For example, to retrieve all XML files within folders starting with '2018' the following URI value can be used: file:///C:\Data\2018*\*.xml.
Note: Using wildcards in directory names is not applicable when connecting to cloud resources.