Spark on Windows

Mallanagouda Patil

5.00/5 (1 vote)

Feb 4, 2017

CPOL

2 min read

15683

This article helps to setup Apache Spark on Windows in easy steps.

Introduction

Apache Spark is designed to run on Linux production environments. However to learn Spark programming we can use Windows machine. In this article I'll explain how we can setup Spark using simple steps and also will run our Hello World Spark program.

Background

Apache Spark is fast and general purpose cluster computing platform. Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. You can find more info from - http://spark.apache.org/ and https://en.wikipedia.org/wiki/Apache_Spark

Softwares required

Apache Spark is built using Scala and runs on JVM. The latest Spark release which is 2.0.2 runs on Java 1.7

Step-1

So first we need to setup Java 1.7 if it's not already. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html#jre-7u76-oth-JPR

Either you can use Installer or binaries. Once Java setup is over then open you command prompt and check the Java version using the command "java -version". It'll display as below

Step-2

Spark depends on winutils.exe which is usually installed along with Hadoop. As we are not going to deploy Hadoop, we need to download this program and set it up and envirnment variable.

Download winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

Create a folder called hadoop/bin wherever you want. I chose c:\backup\hadoop\bin

Create a environment variable called HADOOP_HOME with path c:\backup\hadoop

Step-3

Now download Apache spark from http://spark.apache.org/downloads.html

Unzip it to your preferred location and it looks like this

Update the Path" environment variable with spark bin location - in my case it's C:\backup\spark-2.0.2-bin-hadoop2.7\bin

Test Spark

Spark comes with interactive shell to execute spark APIs. The available shells are

Spark-Shell --> Works with Scala APIs

PySpark --> Works with Python APIs

Open your command prompt and type spark-shell and press enter. You should see Spark shell if all the configurations are set correctly.

Congrates! You have successfully setup Spark on Windows. Now let's try Hadoop hellow world program which is simple word count program :). If you know how to write it using Java MapReduce or Hive SQL or Pig script then you'll really appreciate Spark where we can achieve same using few simple APIs.

A. Make sure that you have sample text file from where you want to count words. Assume it's in c:\temp\test.txt

B. Let's write spark program for hello world

scala> val file = sc.textFile("c:\\temp\\test.txt")  --> Press Enter

scala> val words = file.flatMap(line=>line.split(" ").map(word=>(word,1)).reduceByKey(_+_) -> Press Enter

scala> words.collect -> Press Enter

Spark on Windows

Introduction

Background

Softwares required

Test Spark

History