MongoDB Hadoop Hortonworks in Windows Server 2008

LM Heah

4.75/5 (4 votes)

Jul 11, 2014

CPOL

3 min read

18715

MongoDB Hadoop Hortonworks in Windows Server 2008

Introduction

MongoDB is good for storage and query when the data size is not too big (< 1 million?). But if your data size grows 10 million rows daily, and it comes to over hundred millions rows, performing complex calculations like aggregation would take ages.

With 100 millions rows with about 200GB, I have tried to write a multithread program to perform aggregation using Map/Reduce and Mongo aggregation framework, it ends up something not doable, and when either of it is running, it will affect mongodb performance as well (e.g. The streaming record insert will be slower and slower.)

And, hence it needs to work with Hadoop to do complex calculation. Hadoop basically acts like a processing engine to perform aggregation works for mongodb.

I was looking for a decent hadoop implementation that works well with MongoDB and found Hortonworks Hadoop. The good thing of it is that it has Windows setup package files for ease of installation.

And some more good news, MongoDB Hadoop connector is certified on Hortonworks hadoop.

So, let's get started to run Hadoop with MongoDB.

Prerequisite

MongoDB Installation

Download mongo db from here

Hortonworks Hadoop Installation

Download installation package from here
Installation steps refer to here
Something notes:
- I tried to install hadoop to "D" drive, but its seems to have problems with some hardcoded path I suspect, some of the hadoop services can't be started. I ended up only being able to start all hadoop services (except Hadoop HWI) by installing program files in "C" drive, and data files in "D" drive.
- Force name node to leave safe mode. Execute the following script in command prompt: bin/hadoop dfsadmin -safemode leave

MongoDB Hadoop Connector Installation

Download MongoDB Hadoop connector driver from here
Extract and copy the following files to "<drive>:\hdp\hive-0.13.0.2.1.1.0-1621\lib" & "<drive>:\hdp\pig-0.12.1.2.1.1.0-1621\lib"
- mongo-hadoop-core-1.3.0
- mongo-hadoop-hive-1.3.0
- mongo-java-driver-2.12.2

Using the Code

Hadoop Hive

// Connect to input source
CREATE EXTERNAL TABLE InputCollection(id STRING, RecordDate TIMESTAMP, SourceName STRING) 
STORED BY 'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES('mongo.columns.mapping'='{
    "id":"_id",
    "RecordDate":"RecordDate",
    "SourceName":"SourceName"
}') 
TBLPROPERTIES('mongo.uri'='mongodb://userid:password@server:27017/DatabaseName.InputCollection');

// Create summary table
CREATE TABLE SummaryTable ( RecordDate STRING, SourceName STRING, TOTAL INT );

// Aggregate and insert into summary table
INSERT INTO TABLE SummaryTable 
SELECT to_date(RecordDate), SourceName, COUNT(*) FROM InputCollection 
GROUP BY to_date(RecordDate), SourceName;

Hadoop PIG

Sample Hadoop PIG script to aggregate data by record SourceName & Day from a datetime. I have spent 3 days to learn PIG script and came out with the following parts. One of the most challenging parts would be input/output to mongo and aggregate records based on day of a date.

rawData = LOAD 'mongodb://userid:password@server:27017/DatabaseName.InputCollection' 
USING com.mongodb.hadoop.pig.MongoLoader('id,RecordDate,SourceName','id'); 

RecordDateConversion = FOREACH rawData GENERATE SourceName, 
ToDate(UnixToISO(RecordDate),'yyyy-MM-dd\'T\'HH:mm:ss.SSSZ') AS RecordDateDT;

DataGetSrcAndDateOnly = FOREACH RecordDateConversion GENERATE SourceName, 
CONCAT(CONCAT(CONCAT((chararray)GetYear(RecordDateDT), '-'), 
CONCAT((chararray)GetMonth(RecordDateDT), '-')),(chararray)GetDay(RecordDateDT)) AS RecordDayOnly;

DataGetSrcAndDateOnlyGroup = GROUP DataGetSrcAndDateOnly BY (SourceName, RecordDayOnly);

result = FOREACH DataGetSrcAndDateOnlyGroup GENERATE group.SourceName, COUNT(DataGetSrcAndDateOnly) as Total;

STORE result INTO 'mongodb://userid:password@server:27017/DatabaseName.OutputCollection' 
USING com.mongodb.hadoop.pig.MongoInsertStorage('', '' );

Points of Interest

Hadoop Server Configuration

How to Plan and Configure YARN and MapReduce 2 in HDP 2.0. Refer here.

Hadoop Performance Tuning Best Practices

Refer here

References

More to Come...

Query time comparison using Map/Reduce, Mongo Aggregation and Hadoop
Running hadoop cluster with multiple nodes
Control mongo split size
Configure number of hadoop mapper, reducer
Process BSON documents
Output from Hadoop hive to Mongo
Input BSON from Mongo PIG, output back to Mongo storage
Run Hadoop HWI

History

11^th July, 2014: Initial version