Install pyspark

We need to choose the spark version. it could be 2.4 or bigger. In our case it is 2.4.6.

The installation method is with conda:

conda install -c conda-forge pyspark=2.4.6

Install java

We need to have java. The right version for java. There is a problem with java 272 which comes with Amazon Linux 2. So we have to first remove that version and install the older version.

Query for the current installed openjdk:

rpm -qa | grep java
..you will see something like
java-1.8.0-openjdk-headless-1.8.0.272.b10-1.amzn2.0.1.x86_64
java-1.8.0-openjdk.x86_64 1:1.8.0.272.b10-1.amzn2.0.1
...then remove by
yum remove jdk1.8

Going for Java 265

yum -v list java-1.8.0-openjdk-headless  --show-duplicates
yum -v list java-1.8.0-openjdk  --show-duplicates
...
yum install java-1.8.0-openjdk-1.8.0.265.b01-1.amzn2.0.1
.. headless will be installed by the upper command.

Update alternatives

alternatives --config java

AWS Enable Local Spark

Check what version of hadoom-common you have

ls -l /opt/anaconda3/envs/advanced/lib/python2.7/site-packages/pyspark/jars/hadoop*
....
hadoop-common-2.7.3.jar
...

That means that we have to stick to aws sdk for hadoop 2.7.3 Download hadoop-aws-2.7.3.jar and its dependency aws-java-sdk-1.7.4.jar. Great tutorial found here

So the final code to get the spark running is

def create_local_spark():
    jars = [
        "/opt/jars/hadoop-lzo-0.4.21-SNAPSHOT.jar",
        "/opt/jars/aopalliance-1.0.jar",
        "/opt/jars/bcprov-jdk15on-1.51.jar",
        "/opt/jars/ion-java-1.0.2.jar",
        "/opt/jars/jcl-over-slf4j-1.7.21.jar",
        "/opt/jars/slf4j-api-1.7.21.jar",
        "/opt/jars/bcpkix-jdk15on-1.51.jar",
        "/opt/jars/emrfs-hadoop-assembly-2.19.0.jar",
        "/opt/jars/javax.inject-1.jar",
        "/opt/jars/jmespath-java-1.11.129.jar",
        "/opt/jars/s3-dist-cp-2.7.0.jar",
        "/opt/jars/s3-dist-cp.jar",
        "/opt/jars/mysql-connector-java-5.1.39.jar",
    ]

    aws_1 = [
        "/opt/jars/hadoop-aws-2.7.3.jar",
        "/opt/jars/aws-java-sdk-1.7.4.jar",
    ]

    jars_string = ",".join(jars + aws_1)
    pyspark_shell = "--jars {} --driver-memory 4G pyspark-shell".format(jars_string)

    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_shell
    os.environ["PYSPARK_PYTHON"] = "/opt/anaconda3/envs/advanced/bin/python"

    spark_session = SparkSession.builder.appName("ZZZZZ").getOrCreate()
    hadoop_conf = spark_session._jsc.hadoopConfiguration()

    hadoop_conf.set("com.amazonaws.services.s3.enableV4", "true")
    hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    hadoop_conf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
    hadoop_conf.set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
    hadoop_conf.set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")

    spark_context = spark_session.sparkContext
    sql_context = SQLContext(spark_context)
    # df = spark_session.read.json("s3a://hello/world/")
    return spark_context, sql_context