07 March 2017 spark scala

You might be used to running python scripts through the spark-submit command. It is straightforward for prototyping and testing though running python applications through 'pyspark' is not supported as of Spark 2.0. On the other hand, Scala scripts may be packaged as standalone applications for spark-submit to accept. This might be a bit overwhelming for running simple Scala scripts even with build tools such as sbt or Gradle. It happens the spark-shell provides ways to load and evaluate a Scala script.

Method 1: load within the Spark shell

Type the built-in command starting with ':' to load the Scala script as follows:

scala>:load script.scala

Method 2: run spark-shell with options to specify script and arguements

Type the following command line to run a Scala script with command line arguments.

$> spark-shell -i script.scala --conf spark.driver.args="arg1 arg2 ..."

Option -i specifies the script path. Option --conf allows to specify parameters to pass to the SparkConf instance. spark.driver.args is a built-in property that is recognizable for use.

Method 3: define shell function to run with command line arguments

A shortcut to run a Spark script in Scala with command line arguments is through the system shell. Edit .bash_profile to add function spark-scala to capture the target Scala script with command line arguments as follows.

bash_profile
function spark-scala {
  spark-shell -i "$0" --conf spark.driver.args="$@"
}

The Spark script in Scala then retrieves the command line arguments via the SparkConf property spark.driver.args.

val args = sc.getConf.get("spark.driver.args").split("\\s+")
printf("args=%s\n", args.mkString(", "))
// ...

In this way, the scala script can be executed by spark-shell as a python script submitted by spark-submit with command line arguments passed.

$> spark-scala script.scala -k 3
args=-k, 3


comments powered by Disqus