The provider can be used to create ETL applications and pipelines for CSV data in Python using Petl.
Install Required Modules
Install the Petl modules using the pip utility.
pip install petl
Import the modules, including the CData Python Connector for HDFS. You can then use the provider's connect function to create a connection using a valid HDFS connection string. A SQLAlchemy engine may also be used instead of a direct connection.
import petl as etl import cdata.hdfs as mod cnxn = mod.connect("Host=sandbox-hdp.hortonworks.com;Port=50070;Path=/user/root;")
Extract, Transform, and Load the HDFS Data
Create a SQL query string and store the query results in a DataFrame.
sql = "SELECT FileId, ChildrenNum FROM Files " table1 = etl.fromdb(cnxn,sql)
With the query results stored in a DataFrame, you can load your data into any supported Petl destination. The following example loads the data into a CSV file.