Data Model

Tables

The CData ADO.NET Provider for Apache Kafka dynamically models Apache Kafka topics as tables. A complete list of discovered topics can be obtained from the sys_tables system table.

SELECTing from a topic returns existing messages on the topic, as well as live messages posted before the number of seconds specified by the ReadDuration have elapsed.

Stored Procedures

Stored Procedures are function-like interfaces to Apache Kafka. They can be used to create schema files, commit messages, and more.

Consumer Groups

Connections that the provider makes to Apache Kafka are always part of a consumer group. You can control the consumer group by setting a value for the ConsumerGroupId connection property. Using the same consumer group ID across multiple connections puts those connections into the same consumer group. The provider generates a random consumer group ID if one is not provided.

All members of a consumer group share an offset that determines what messages are read next within each topic and partition. The provider supports two ways of updating the offset:

If AutoCommit is enabled, the provider periodically commits the offset for any topics and partitions that have been read by SELECT queries. The exact interval is determined by the auto-commit properties in the native library. See ConsumerProperties for details on how to configure these properties.
The CommitOffset stored procedure stores the offset of the last item read by the current query. Note that this must be called while the query resultset is still open. The provider resets the offset when the resultset is closed.

If there is no existing offset, the provider uses the OffsetResetStrategy to determine what the offset should be. This may happen if the broker does not recognize the consumer group or if the consumer group never committed an offset.

Bulk Messages

The provider supports reading bulk messages from topics using the CSV, JSON, or XML SerializationFormat. When the provider reads CSV data like the following block, it splits the CSV and outputs each line as a separate row. The values of other columns like the partition, timestamp, and key are the same across each row.

"1","alpha"
"2","beta"
"3","gamma"

Bulk messages are not supported for key values. When MessageKeyType is set to a bulk format, the provider reads only the first row of the key and ignore the rest. For example, when the provider reads the above CSV data as a message key, the entries on the alpha row are repeated across every bulk row from the message value. The entries on the beta and gamma rows are lost.

Bulk Limitations

Apache Kafka does not natively support bulk messages, which can lead to rows being skipped in some circumstances. For example:

A provider connection is created with ConsumerGroupId=x
The connection executes the query SELECT * FROM topic LIMIT 3.
The connection commits its offset and closes.
Another connection is created with the same ConsumerGroupId
The connection executes the query SELECT * FROM topic.

Consider what happens if this procedure is performed on the following topic. The first connection consumes all rows from the first message and one row from the second. However, the provider has no way to report to Apache Kafka that only part of the second message was read. This means that step 3 commits the offset 3 and the second connection starts on row 5, skipping row 4.

"row 1"
"row 2"
/* End of message 1 */

"row 3"
"row 4"
/* End of message 2 */

"row 5"
"row 6"
/* End of message 3 */