From Pandas
When combined with the connector, Pandas can be used to generate data frames that contain your HDFS data. Once created, a data frame can be passed to various other Python packages.
Connecting
Pandas relies on an SQLAlchemy engine to execute queries. Before you can use Pandas you must import it:import pandas as pd from sqlalchemy import create_engine engine = create_engine("hdfs:///?Host=sandbox-hdp.hortonworks.com;Port=50070;Path=/user/root;")
Querying Data
In Pandas, SELECT queries are provided in a call to the read_sql() method, alongside a relevant connection object. Pandas executes the query on that connection, and returns the results in the form of a data frame, which can be used for a variety of purposes.df = pd.read_sql(""" SELECT FileId, ChildrenNum, $exNumericCol; FROM Files;""", engine) print(df)