Extracting Metadata From Topics
Reading Apache Kafka Data
Reads in Apache Kafka don't have a natural stopping point. To avoid perpetual read operations, items are read until the ReadDuration or Timeout expires. ReadDuration is set to 30 seconds by default.
The driver models topics as tables and messages as rows.
It facilitates this in two ways:
- For services that contain a schema registry, such as Confluent and AWS hosted instances, the schema is read directly from the schema registry.
- For services that do not contain a schema registry, the schema is inferred by the driver.
Schema Registry
Set the following to connect to a service with a schema registry:
- BootstrapServers: The server (hostname or IP address) and port (in the format server:port) of the Apache Kafka BoostrapServers.
- TypeDetectionScheme: Set to SchemaRegistry.
- RegistryAuthScheme: Set to the appropriate authentication method, see the next sections for details.
- RegistryService: The schema registry service used to read topic schemas. The options are Confluent and AWSGlue.
- RegistryUrl: Set to the server for the schema registry.
The schema registry contains a list of topics which have registered schemas. The list of tables and columns are simply read directly from the schema registry.
Confluent Schema Registry
When you connect to Confluent Cloud, the RegistryUrl corresponds to the Schema Registry endpoint value in Schemas -> Schema Registry -> Instructions.
The Confluent schema registry supports several authentication options. Confluent Cloud deployments will typically require RegistryAuthScheme to be set to Basic, along with a RegistryUser and RegistryPassword. These can be found by navigating to Schemas > Schema Registry > API Access and finding the access key and secret key values.
On-premise deployments may not require authentication, in these configurations RegistryAuthScheme should be set to None. They may also require SSL client certificates, which can be set using the SSLCertificate RegistryAuthScheme along with the RegistryClientCert and RegistryClientCertType options.
AWS Glue Schema Registry
When connecting to AWS Glue, the RegistryUrl corresponds to the ARN value of the registry.
The AWS Glue schema registry only supports the Basic RegistryAuthScheme. RegistryUser and RegistryPassword, and should be set to the access key and secret key of a user with access to the registry.
No Schema Registry
Set the following to connect to a service without a schema registry:
- BootstrapServers: The server (hostname or IP address) and port (in the format server:port) of the Apache Kafka BoostrapServers.
- TypeDetectionScheme: Set to RowScan.
Schema discovery is performed as follows:
- An attempt is made to autodetect the format (AVRO/JSON/XML/CSV). This can also be set explicitly with the SerializationFormat property.
- With the format read, rows are analyzed from the topic. Set a higher RowScanDepth for increased accuracy, though higher depth may decrease performance. The driver begins reading at the current offset (configurable via the OffsetResetStrategy property). From this point, future SELECTs will start from the beginning.
- Deserialization is performed based on the determined serialization format, completing schema discovery.