30 March 2017 spark scala aws s3 bucket

Assume hadoop-2.7.3 is installed via brew on macOS. For single machine use, the configurations may be set statically. For multiple users, the configurations may be set pragmatically. The latest url scheme to access an S3 bucket begins with s3a://.

Static Configration

Step1

Tell spark where hadoop configuration files are.

/usr/local/Cellar/apache-spark/2.1.0/libexec/conf/spark-env.sh
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.3/libexec
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
Step2

Set S3 endpoint and access key for Hadoop

/usr/local/Cellar/hadoop/2.7.3/libexec/etc/hadoop/core-site.xml
<configuration>
    <property>
        <name>fs.s3a.endpoint</name>
        <description>AWS S3 endpoint to connect to. An up-to-date list is
            provided in the AWS Documentation: regions and endpoints. Without this
            property, the standard region (s3.amazonaws.com) is assumed.
        </description>
        <value>s3.us-east-2.amazonaws.com</value>
    </property>
    <property>
        <name>fs.s3a.impl</name>
        <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
        <description>The implementation class of the S3A Filesystem</description>
    </property>
    <property>
        <name>fs.s3a.access.key</name>
        <description>AWS access key ID.</description>
        <value>**********</value>
    </property>
    <property>
        <name>fs.s3a.secret.key</name>
        <description>AWS secret key.</description>
        <value>**********</value>
    </property>
</configuration>

Alternatively, set the access key in the user’s environment.

bash_profile
export AWS_ACCESS_KEY_ID=***************
export AWS_SECRET_ACCESS_KEY=************
Step3

Set CLASSPATH of AWSClient and enable signature version V4 Support through spark-shell

The V4 support is for those S3 endpoints such Ohio.

bash_profile
export HADOOP_HOME=/usr/local/Cellar/hadoop/2.7.3/libexec
export AWS_CLASSPATH="$HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-2.7.3.jar"
export AWS_CLASSPATH="$AWS_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-1.7.4.jar"
export AWS_CLASSPATH="$AWS_CLASSPATH:$HADOOP_HOME/share/hadoop/tools/lib/guava-11.0.2.jar"
alias spark-shell="spark-shell --driver-class-path="$AWS_CLASSPATH" --driver-java-options "-Dcom.amazonaws.services.s3.enableV4=true""

Pragmatic Configuration

The AWS_CLASSPATH above still needs to be passed to spark-shell. Run spark-shell, enable V4 support and set the S3 access/secret key and endpoint as follows.

System.setProperty("com.amazonaws.services.s3.enableV4", "true")
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
sc.hadoopConfiguration.set("fs.s3a.access.key", "**********")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "**********")

Read RDD from S3 Bucket

Finally, count a RDD by reading from a created S3 bucket.

val rdd = sc.textFile("s3a://cs5630s17/part00001.gz")
rdd.count


comments powered by Disqus