Cmdlets for Spark SQL

Build 24.0.9060

Establishing a Connection

With the CData Cmdlets users can install a data module, set the connection properties, and start scripting. This section provides examples of using our SparkSQL Cmdlets with native PowerShell cmdlets, like the CSV import and export cmdlets.

Installing and Connecting

If you have PSGet, installing the cmdlets can be accomplished from the PowerShell Gallery with the following command. You can also obtain a setup from the CData site.

Install-Module SparkSQLCmdlets

The following line is then added to your profile, loading the cmdlets on the next session:

Import-Module SparkSQLCmdlets;

You can then use the Connect-SparkSQL cmdlet to create a connection object that can be passed to other cmdlets:

$conn = Connect-SparkSQL -Server '127.0.0.1'

Connecting to Spark SQL

Specify the following to establish a connection with Spark SQL:

  • Server: The host name or IP address of the server hosting SparkSQL.
  • Port: The port for the connection to the SparkSQL instance.
  • TransportMode: The transport mode to use to communicate with the SparkSQL server. Accepted entries are BINARY and HTTP. BINARY is selected by default.

Securing Spark SQL Connections

To enable TLS/SSL in the cmdlet, set UseSSL to True.

Authenticating to Spark SQL

The service may be authenticated to using the PLAIN, LDAP, NOSASL, KERBEROS auth schemes.

PLAIN

To authenticate with PLAIN, set the following connection properties:

  • AuthScheme: PLAIN.
  • User: The user to login as.
  • Password: The password of the user.
To authenticate, set User and Password.

LDAP

To authenticate with LDAP, set the following connection properties:

  • AuthScheme: LDAP.
  • User: The user to login as.
  • Password: The password of the user.
To authenticate, set User, Password, and AuthScheme.

NOSASL

When using NOSASL, no authentication is performed. Set the following connection properties:

  • AuthScheme: NOSASL.

Kerberos

For details on how to authenticate with Kerberos, see Using Kerberos.

Retrieving Data

The Select-SparkSQL cmdlet provides a native PowerShell interface for retrieving data:

$results = Select-SparkSQL -Connection $conn -Table "Customers" -Columns @("City, CompanyName") -Where "Country='US'"
The Invoke-SparkSQL cmdlet provides an SQL interface. This cmdlet can be used to execute an SQL query via the Query parameter.

Piping Cmdlet Output

The cmdlets return row objects to the pipeline one row at a time. The following line exports results to a CSV file:

Select-SparkSQL -Connection $conn -Table Customers -Where "Country = 'US'" | Select -Property * -ExcludeProperty Connection,Table,Columns | Export-Csv -Path c:\myCustomersData.csv -NoTypeInformation

You will notice that we piped the results from Select-SparkSQL into a Select-Object cmdlet and excluded some properties before piping them into an Export-CSV cmdlet. We do this because the CData Cmdlets append Connection, Table, and Columns information onto each row object in the result set, and we do not necessarily want that information in our CSV file.

However, this makes it easy to pipe the output of one cmdlet to another. The following is an example of converting a result set to JSON:

 
PS C:\> $conn  = Connect-SparkSQL -Server '127.0.0.1'
PS C:\> $row = Select-SparkSQL -Connection $conn -Table "Customers" -Columns (City, CompanyName) -Where "Country = 'US'" | select -first 1
PS C:\> $row | ConvertTo-Json
{
  "Connection":  {

  },
  "Table":  "Customers",
  "Columns":  [

  ],
  "City":  "MyCity",
  "CompanyName":  "MyCompanyName"
} 

Modifying Data

The cmdlets make data transformation easy as well as data cleansing. The following example loads data from a CSV file into Spark SQL, checking first whether a record already exists and needs to be updated instead of inserted.

Import-Csv -Path C:\MyCustomersUpdates.csv | %{
  $record = Select-SparkSQL -Connection $conn -Table Customers -Where ("_id = `'"+$_._id+"`'")
  if($record){
    Update-SparkSQL -Connection $conn -Table Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName) -Where "_id  = `'$_._id`'"
  }else{
    Add-SparkSQL -Connection $conn -Table Customers -Columns @("City","CompanyName") -Values @($_.City, $_.CompanyName)
  }
}

Copyright (c) 2024 CData Software, Inc. - All rights reserved.
Build 24.0.9060