Install pyspark

We need to choose the spark version. it could be 2.4 or bigger. In our case it is 2.4.6.

The installation method is with conda:

conda install -c conda-forge pyspark=2.4.6

Install java

We need to have java. The right version for java. There is a problem with java 272 which comes with Amazon Linux 2. So we have to first remove that version and install the older version.

Query for the current installed openjdk:

rpm -qa | grep java will see something like
java-1.8.0-openjdk.x86_64 1:
...then remove by
yum remove jdk1.8

Going for Java 265

yum -v list java-1.8.0-openjdk-headless  --show-duplicates
yum -v list java-1.8.0-openjdk  --show-duplicates
yum install java-1.8.0-openjdk-
.. headless will be installed by the upper command.

Update alternatives

alternatives --config java

AWS Enable Local Spark

Check what version of hadoom-common you have

ls -l /opt/anaconda3/envs/advanced/lib/python2.7/site-packages/pyspark/jars/hadoop*

That means that we have to stick to aws sdk for hadoop 2.7.3 Download hadoop-aws-2.7.3.jar and its dependency aws-java-sdk-1.7.4.jar. Great tutorial found here

So the final code to get the spark running is

def create_local_spark():
    jars = [

    aws_1 = [

    jars_string = ",".join(jars + aws_1)
    pyspark_shell = "--jars {} --driver-memory 4G pyspark-shell".format(jars_string)

    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_shell
    os.environ["PYSPARK_PYTHON"] = "/opt/anaconda3/envs/advanced/bin/python"

    spark_session = SparkSession.builder.appName("ZZZZZ").getOrCreate()
    hadoop_conf = spark_session._jsc.hadoopConfiguration()

    hadoop_conf.set("", "true")
    hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
    hadoop_conf.set("", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
    hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

    spark_context = spark_session.sparkContext
    sql_context = SQLContext(spark_context)
    # df ="s3a://hello/world/")
    return spark_context, sql_context