Hadoop — Parallelism or Serialism ?

Source : Owner

According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem, is it a fact or myth ?

Let’s research & conclude with proper proof (Hint : tcpdump) —

Let’s start our research with

  1. Create an account on AWS.
  2. Launch four Ec2 instance on AWS.
  3. Configure one instance as Name Node, one as Client and remaining two as Data Nodes.
  4. Install jdk and hadoop package in all instances.
  5. Configure “hdfs-site.xml” and “core-site.xml” file in both data nodes and name node. (Reminder, no need to configure “hdfs-site.xml” file in hadoop-client , only configure “core-site.xml” file).
  6. Format the Name Node.
  7. Start the Hadoop daemon services in both Data Nodes and Name Node and check by using “jps” command.
  8. Check Data Nodes available to the Hadoop-Cluster by using command

hadoop dfsadmin -report

  1. Hadoop-Client uploads a file to the Hadoop-Cluster by using command:-

hadoop fs -put <file_name> /

10. Check file in hadoop-cluster by using command:-

hadoop fs -ls /

11. While uploading file , RUN tcpdump command in NAME NODE and in both DATA NODE’S :

Firstly, install tcpdump package —

yum install tcpdump

Run tcpdump command to check the transferred packets between client , master and slaves —

tcpdump -i eth0 -n -x

tcpdump -i eth0 tcp port 22 -n

  • It will shows that which one is requesting and which one is replying. While running the above command, you will get to know that In Name Node, Client is requesting Master(or Name Node) to get the IP-ADDRESSES of Data Nodes as Client is the one who directly uploads the data to the Data Nodes and Master is replying by sending the network packets to the Client which contains the IP-addresses of Data Nodes.
  • To trace the DATA PACKETS (or data flow), use port no. 50010 & run command —

tcpdump -i eth0 port 50010 -n -x

  • While running this command in both DATANODES and in NAMENODE , you will see some data packets are receiving to the Data Nodes. These data packets are received by data nodes in such a manner that Firstly , some packets received by DataNode1 and then, stops . After this, some packets received by DataNode2 and then, stops. Again data packets received to DN1 and when it stops ,then ,received by DN2 & so on.
  • This process goes on till the whole file gets uploaded to the Hadoop-Cluster. This will help to uploads the data fast or you can also check its time-stamp during uploading of file in both the slaves , it will definitely differ as it is not transferring data in parallel.
  • Thus, the data flow in data packets from CLIENT to the DATA NODE’S is in the serial order.
  • So, We can say that Hadoop uses the concept of “serialism” to upload the split data while fulfilling Velocity problem.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store