Click here to Skip to main content
15,886,724 members
Articles / High Performance Computing / Vectorization
Tip/Trick

Don’t Use Elephant for Your Garden Work

Rate me:
Please Sign up or sign in to vote.
3.33/5 (6 votes)
20 Jun 2014CPL2 min read 13.2K   5   3
Don’t use elephant for your garden work

While learning the new Tez engine and query vectorization concepts in Hadoop 2.0, I came to know that the query vectorization is claimed as 3x powerful and consume less CPU time in actual Hadoop cluster. Hortonworks tutorial uses a sample sensor data in a CSV that is imported into Hive. Then a sample has been used to explain the performance.

The intention of this post is neither explaining Tez engine and query vectorization nor Hive query. Let us familiarize the problem I have worked before getting to know the purpose of this post. :)

One sample CSV file called ‘HVAC.csv’ contains 8000 records that contain temperature information on different building during different days. Part of the file content:

Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
… 

In the Hive, following configurations are specified to enable Tez engine and query vectorization.

hive> set hive.execution.engine=mr;
hive> set hive.execution.engine=tez;
hive> set hive.vectorized.execution.enabled;
      hive.vectorized.execution.enabled=true

I execute the following query in my sandbox that surprisingly took 48 seconds for a ‘group by’ and ‘count’ on 8000 records as shown below:

select date, count(buildingid) from hvac_orc group by date;

This query groups the sensor data by date and counts the number of buildings for that date. It produces 30 results as shown below:

Status: Finished successfully
OK
6/1/13  267
6/10/13 267
6/11/13 267
...
Time taken: 48.261 seconds, Fetched: 30 row(s)

Then I plan to write a simple program without MapReduce castle, since it is just 8000 records. I created a F# script that reads the CSV (note that I did not use any CSV type provider) and using Deedle exploratory library (again, LINQ can also help). I achieved the same result as shown below.

C++
module ft

#I "..\packages\Deedle.1.0.0"
#load "Deedle.fsx"
open System
open System.IO
open System.Globalization
open System.Diagnostics
open Deedle

type hvac = { Date : DateTime; BuildingID : int}

let execute =
    let stopwatch = Stopwatch.StartNew()

    let enus = new CultureInfo("en-US")
    let fs = new StreamReader("..\ml\SensorFiles\HVAC.csv")
    let lines = fs.ReadToEnd() |> (fun s -> s.Split("\r\n".ToCharArray()))

    let ohvac = lines.[1..(Array.length lines) - 1]
                |> Array.map (fun s -> s.Split(",".ToCharArray()))
                |> Array.map (fun s -> {Date = DateTime.Parse(s.[0], enus); BuildingID = int(s.[6])})
                |> Frame.ofRecords

    let result = ohvac.GroupRowsBy("Date")
                |> Frame.getNumericCols
                |> Series.mapValues (Stats.levelCount fst)
                |> Frame.ofColumns

    stopwatch.Stop()
    (stopwatch.ElapsedMilliseconds, result)

In the FSI,

C++
> #load "finalTouch.fsx";;
> open ft;;
> ft.execute;;
val it : int64 * Deedle.Frame =
(83L,
BuildingID
01-06-2013 12:00:00 AM -> 267
02-06-2013 12:00:00 AM -> 267
03-06-2013 12:00:00 AM -> 267
04-06-2013 12:00:00 AM -> 267
...

The is completed within 83 milliseconds. You may argue that I am comparing apples with oranges. No!. My intention is to understand when MapReduce is the savior. The parable of the above exercise is that be cautious and analyze well before moving your data processing mechanisms into MapReduce clusters.

Elephants are very effective in labor requiring hard slogging and heavy lifting. Not for your gardens!! :)

Note that the sample CSV files from HortonWorks is clearly for training purposes. This blog post just takes that as an example to project the maximum data-generation capability of a small or medium size application for a period. The above script may not scale and will not perform well with more than the above numbers. Hence, this is not anti-MapReduce proposal.

License

This article, along with any associated source code and files, is licensed under The Common Public License Version 1.0 (CPL)


Written By
Architect Aditi
India India
Working as Architect for Aditi, Chennai, India.

My Website: www.udooz.net

My Blog: www.udooz.net/blog

Comments and Discussions

 
GeneralMy vote of 2 Pin
Rage22-Jun-14 23:37
professionalRage22-Jun-14 23:37 
QuestionSome thoughts Pin
phil.o20-Jun-14 14:04
professionalphil.o20-Jun-14 14:04 
SuggestionMore a tip than article. Pin
DaveAuld20-Jun-14 10:23
professionalDaveAuld20-Jun-14 10:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.