Cmdlets for Apache Hive

Build 23.0.8839

Establishing a Connection

With the CData Cmdlets users can install a data module, set the connection properties, and start scripting. This section provides examples of using our ApacheHive Cmdlets with native PowerShell cmdlets, like the CSV import and export cmdlets.

Installing and Connecting

If you have PSGet, installing the cmdlets can be accomplished from the PowerShell Gallery with the following command. You can also obtain a setup from the CData site.

Install-Module ApacheHiveCmdlets

The following line is then added to your profile, loading the cmdlets on the next session:

Import-Module ApacheHiveCmdlets;

You can then use the Connect-ApacheHive cmdlet to create a connection object that can be passed to other cmdlets:

$conn = Connect-ApacheHive -Server "127.0.0.1" -Port "10000" -Transportmode "BINARY"

Connecting to Apache Hive

Apache Hive supports connections from instances that are self-hosted, hosted on Amazon EMR, hosted on Azure HDInsight, or hosted on Google DataProc.

Self-hosted Instance

To connect to Apache Hive from a self-hosted instance, set these connection parameters:

  • TransportMode: The transport mode to use to communicate with the Hive server. Legal entries are BINARY (default) and HTTP.
  • Server: The host name or IP address of the server hosting HiveServer2.
  • Port: The port for the connection to the HiveServer2 instance.
  • UseSSL (optional): Set to enable TLS/SSL.

Amazon EMR

To connect from an Amazon EMR-hosted Apache Hive instance, you must first create an SSH tunnel to EMR. When that is done, you are ready to connect.

Do the following:

  1. Create the SSH tunnel to EMR:
    1. Ensure that you have access to an active EMR cluster and an EC2 key pair. The key pair can be in .ppk or .pem format.
    2. Authorize inbound traffic in your cluster settings.
  2. After an active tunnel session to EMR is established, set these connection parameters:
    • Server: The master node (master-public-dns-name) where the Apache Hive server is running.
    • Port: The port required to connect to Apache Hive.
    • UseSSH: True.
    • SSHServer: The master node (master-public-dns-name).
    • SSHPort: 22.
    • SSHAuthMode: PUBLIC_KEY.
    • SSHUser: hadoop.
    • SSHClientCert: The full path to the key file.
    • SSHClientCertType: The type that corresponds to the key file. Typically either PEMKEY_FILE or PPKFILE.

Hadoop Cluster on Azure HDInsight

To establish a connection to a Hadoop cluster hosted on Azure HDInsight, set these connection parameters:

  • User: The cluster username you specified when creating the cluster on Azure HDInsight.
  • Password: The cluster password you specified when creating the cluster on Azure HDInsight.
  • Server: The server corresponding to your cluster. For example: myclustername.azurehdinsight.net.
  • Port: The port running HiveServer2 (443 is the default).
  • HTTPPath: The HTTP path for the hive2 service (hive2 by default).
  • TransportMode: HTTP.
  • UseSSL: True.
  • QueryPassthrough (optional): True to bypass the SQL engine of the cmdlet and execute HiveQL queries directly to Apache Hive.

Google DataProc

To connect from an Google DataProc instance, you must ensure that the Apache Hive server on DataProc is properly configured, then build an SSH tunnel to the hive cluster web interface. After those things are done, you are ready to connect.

Do the following:

  1. Ensure that the Apache Hive server on DataProc was created with the DataProc Component Gateway enabled.
  2. To obtain the external IP address of the Hive Cluster, load up the Cloud Shell and list the instances.
    gcloud compute instances list

    Note the external IP of the relevant machine.

  3. Build an SSH tunnel to the Hive cluster web interface:
    1. Navigate to the Hive cluster on DataProc and select the WEB INTERFACES tab.
    2. Select Create an SSH tunnel to connect to a web interface. A cloud console command displays.
      Use this console to create an SSH key pair. Download the private key from the directory specified in the console.
    3. Configure the SSH tunnel in an SSH utility:
      • Host Name: The external IP noted above.
      • Port: 22
      • Point the tool to your private SSH key.
      • For the Tunnel, map an open port to localhost:10000. The server properly resolves localhost.
  4. Now you are ready to connect to Apache Hive on Google DataProc. Set these connection parameters:
    • TransportMode: BINARY.
    • AuthScheme: Plain.
    • Port: The chosen SSH Tunnel port on the local machine.

Authenticating to Apache Hive

Apache Hive supports PLAIN SASL, LDAP, NOSASL, and Kerberos authentication.

PLAIN SASL

To authenticate to Apache Hive with PLAIN SASL, set the hive.server2.authentication property in your Hive configuration file (hive-site.xml) to None and set the following cmdlet connection properties:

  • AuthScheme: PLAIN
  • User The user logging in. If no name is supplied, the login user is set to "anonymous".
  • Password The user's password. If no password is supplied, the password is set to "anonymous".

LDAP

To authenticate to Apache Hive with LDAP, set the hive.server2.authentication property in your Hive configuration file (hive-site.xml) to LDAP and set the following connection properties in the cmdlet.
  • AuthScheme: LDAP
  • User The user logging in.
  • Password The user's password.

NOSASL

Authenticate to Apache Hive using NOSASL by setting both the AuthScheme cmdlet connection property and the hive.server2.authentication property in your Hive configuration file (hive-site.xml) to NOSASL

Kerberos

To authenticate to Apache Hive with Kerberos, set AuthScheme to KERBEROS.

Authenticating to Apache Hive via Kerberos requires you to define authentication properties and to choose how Kerberos should retrieve authentication tickets.

Retrieve Kerberos Tickets

Kerberos tickets are used to authenticate the requester's identity. The use of tickets instead of formal logins/passwords eliminates the need to store passwords locally or send them over a network. Users are reauthenticated (tickets are refreshed) whenever they log in at their local computer or enter kinit USER at the command prompt.

The cmdlet provides three ways to retrieve the required Kerberos ticket, depending on whether or not the KRB5CCNAME and/or KerberosKeytabFile variables exist in your environment.

MIT Kerberos Credential Cache File

This option enables you to use the MIT Kerberos Ticket Manager or kinit command to get tickets. With this option there is no need to set the User or Password connection properties.

This option requires that KRB5CCNAME has been created in your system.

To enable ticket retrieval via MIT Cerberos Credential Cache Files:

  1. Ensure that the KRB5CCNAME variable is present in your environment.
  2. Set KRB5CCNAME to a path that points to your credential cache file. (For example, C:\krb_cache\krb5cc_0 or /tmp/krb5cc_0.) The credential cache file is created when you use the MIT Kerberos Ticket Manager to generate your ticket.
  3. To obtain a ticket:
    1. Open the MIT Kerberos Ticket Manager application.
    2. Click Get Ticket.
    3. Enter your principal name and password.
    4. Click OK.

    If the ticket is successfully obtained, the ticket information appears in Kerberos Ticket Manager and is stored in the credential cache file.

The cmdlet uses the cache file to obtain the Kerberos ticket to connect to Apache Hive.

Note: If you would prefer not to edit KRB5CCNAME, you can use the KerberosTicketCache property to set the file path manually. After this is set, the cmdlet uses the specified cache file to obtain the Kerberos ticket to connect to Apache Hive.

Keytab File

If your environment lacks the KRB5CCNAME environment variable, you can retrieve a Kerberos ticket using a Keytab File.

To use this method, set the User property to the desired username, and set the KerberosKeytabFile property to a file path pointing to the keytab file associated with the user.

User and Password

If your environment lacks the KRB5CCNAME environment variable and the KerberosKeytabFile property has not been set, you can retrieve a ticket using a user and password combination.

To use this method, set the User and Password properties to the user/password combination that you use to authenticate with Apache Hive.

Enabling Cross-Realm Authentication

More complex Kerberos environments can require cross-realm authentication where multiple realms and KDC servers are used. For example, they might use one realm/KDC for user authentication, and another realm/KDC for obtaining the service ticket.

To enable this kind of cross-realm authentication, set the KerberosRealm and KerberosKDC properties to the values required for user authentication. Also, set the KerberosServiceRealm and KerberosServiceKDC properties to the values required to obtain the service ticket.

Retrieving Data

The Select-ApacheHive cmdlet provides a native PowerShell interface for retrieving data:

$results = Select-ApacheHive -Connection $conn -Table "[CData].[Default].Customers" -Columns @("City, CompanyName") -Where "Country='US'"
The Invoke-ApacheHive cmdlet provides an SQL interface. This cmdlet can be used to execute an SQL query via the Query parameter.

Piping Cmdlet Output

The cmdlets return row objects to the pipeline one row at a time. The following line exports results to a CSV file:

Select-ApacheHive -Connection $conn -Table [CData].[Default].Customers -Where "Country = 'US'" | Select -Property * -ExcludeProperty Connection,Table,Columns | Export-Csv -Path c:\my[CData].[Default].CustomersData.csv -NoTypeInformation

You will notice that we piped the results from Select-ApacheHive into a Select-Object cmdlet and excluded some properties before piping them into an Export-CSV cmdlet. We do this because the CData Cmdlets append Connection, Table, and Columns information onto each row object in the result set, and we do not necessarily want that information in our CSV file.

However, this makes it easy to pipe the output of one cmdlet to another. The following is an example of converting a result set to JSON:

 
PS C:\> $conn  = Connect-ApacheHive -Server "127.0.0.1" -Port "10000" -Transportmode "BINARY"
PS C:\> $row = Select-ApacheHive -Connection $conn -Table "[CData].[Default].Customers" -Columns (City, CompanyName) -Where "Country = 'US'" | select -first 1
PS C:\> $row | ConvertTo-Json
{
  "Connection":  {

  },
  "Table":  "[CData].[Default].Customers",
  "Columns":  [

  ],
  "City":  "MyCity",
  "CompanyName":  "MyCompanyName"
} 

Deleting Data

The following line deletes any records that match the criteria:

Select-ApacheHive -Connection $conn -Table [CData].[Default].Customers -Where "Country = 'US'" | Remove-ApacheHive

Modifying Data

The cmdlets make data transformation easy as well as data cleansing. The following example loads data from a CSV file into Apache Hive, checking first whether a record already exists and needs to be updated instead of inserted.

Import-Csv -Path C:\My[CData].[Default].CustomersUpdates.csv | %{
  $record = Select-ApacheHive -Connection $conn -Table [CData].[Default].Customers -Where ("_id = `'"+$_._id+"`'")
  if($record){
    Update-ApacheHive -Connection $conn -Table [CData].[Default].Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName) -Where "_id  = `'$_._id`'"
  }else{
    Add-ApacheHive -Connection $conn -Table [CData].[Default].Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName)
  }
}

Copyright (c) 2024 CData Software, Inc. - All rights reserved.
Build 23.0.8839