Click here to Skip to main content
15,881,089 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
suppose that a user wants to run a job on a hadoop cluster,with a primary data of size 10 petabytes.how and when the client node,breaks this data into blocks?
I mean,since the client has limited resources,the user can't upload such a big file directly on it.he should copy it part by part and wait for client to store those parts as blocks.and then send other parts.
but such segmentation is not mentioned in any of the documents I've read.

How this process is done?
Posted

okay, you need to distinguish between
1. building a 1 petabytes data set
Usually, you don't build a single 1 petabytes database by importing a 1 PB file, usually a PB database get built up over time, one small piece a time

2. running an analysis of 1 petabytes data set
Hadoop HDFS (Hadoop Distributed File System) - for example, each slave stores 1TB of the total 1 PB and you want to find Max(x) from this 10PB data set distributed over 1000 slaves (1TB/1PB). Hadoop does this by running Max calculation on each slaves (each on separate machine). Client can run a separate "max" computation over ten results (max of 1TB) from 1000 slaves. This way, you don't ever need assemble the entire 1PB in memory.

This makes Hadoop very/lineraly scalable.
 
Share this answer
 
Comments
CPallini 21-May-18 9:21am    
5.
May be this link could help you
http://www.youtube.com/watch?v=ziqx2hJY8Hg[^]
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900