Simple Spark Scala Application - Tutorial by Mohamed Fayaz

Apache Spark Application

In this post, we will be looking at how to submit a Spark job in AWS EMR. For this, we have an application that converts any given files to Avro or Parquet format using Apache Spark.

Project Folder Structure

  |-- project/
  |   |-- build.properties
  |   |-- plugins.sbt
  |-- target/
  |-- src/
  |   |-- main
  |   |-- test
  |-- build.sbt

Based on the above structure, the project folder contains all the configurations including build properties and sbt setup. target folder contains the artifacts which include the assembly JAR and other build files. The src folder has all the source code including the business logic to convert file types and unit test suits using the ScalaTest framework.

Scala Functions

Spark session initialisation

To run any Spark application we would need to initialise the Spark session first which is present in the src/main/scala/au/fayaz/Utils.scala file :

def initiateSpark(): SparkSession = {
val spark = SparkSession.builder.master("local[*]").appName("File Converter").getOrCreate()
spark
}

The initiateSpark function will return a session to perform our DataFrame operations. In this example, we will be converting the file to Parquet or Avro based on the parameter.

def writeOutput(dataframe: DataFrame,
outputPath: String, format: String): = {
  if (format == "parquet") {
  dataframe
      .write
      .option("compression", "snappy")
      .mode("overwrite")
      .parquet(outputPath)
      } 
  else
      dataframe
          .write
          .format(
            "com.databricks.spark.avro"
            )
          .mode("overwrite")
          .save(outputPath)
}

The writeOutput function takes the parameters of dataframe, outputPath and format. Based on the format it performs a simple write operation to save the file in an appropriate format with the right compression.

To Use this Code

To make use of this code, you can simply follow the below steps to be able to run in your EMR environment.

Prerequisites

Git installed in your local machine
Scala Installed in your local machine
EMR Cluster is ready for use

Once you have both, all you need to do is a simple clone of this repository and run the steps below:

Step 1: git clone https://github.com/fayaz92/file-converter-spark-scala.git.

Step 2: Open your IDE and import this project, my choice is IntelliJ in this case.

Step 3: From the Command Line, type sbt test which will perform the unit testing to ensure your code is working as expected.

Step 4: From the command line, type sbt assembly which build the artifacts as .JAR file.

Step 5: Now to run this code in your EMR, you would need to copy this .jar file to S3.

Step 6: Once you have the JAR file in S3, now you are ready to run the code. So SSH into the EMR cluster and run the below command.

spark-submit --class au.fayaz.Converter 
s3://yourbucket/converter-assembly-0.1.jar 
s3://yourbucket/inputFiles/sample.csv 
s3://yourbucket/outputFiles/20190524/ 
parquet

🎉 Now you should get a converted Parquet file in the output location !!

Thank you for reading and click here to download this application.

Simple Spark Application to run on EMR

Apache Spark Application

Project Folder Structure

Scala Functions

Spark session initialisation

To Use this Code

Prerequisites

Comments

Amazon Web Services (AWS)

AWS CodeCommit and GitKraken Basics: The Essential Skills for Every Developer

More from this blog

AWS CodeCommit and GitKraken Basics: The Essential Skills for Every Developer

Privacy-Preserving AI

Why Data Competency Is Critical for Cyber Intelligence

Introduction to AWS AppSync - Fully managed GraphQL Service

Data & Analytics Services at AWS re:Invent 2022: A Recap

Command Palette

Apache Spark Application

Project Folder Structure

Scala Functions

Spark session initialisation

To Use this Code

Prerequisites

Comments

Amazon Web Services (AWS)

AWS CodeCommit and GitKraken Basics: The Essential Skills for Every Developer

More from this blog