Hadoop — Parallelism or Serialism ?
According to popular articles, Hadoop uses the concept of parallelism to upload the split data while fulfilling Velocity problem, is it a fact or myth ?
Let’s research & conclude with proper proof (Hint : tcpdump) —
- tcpdump is a most powerful and widely used command-line package analyzer tool which is used to capture or filter TCP/IP packets that received or transferred over a network on a specific interface. It also gives us a option to save captured packets in a file for future analysis.
Let’s start our research with
- Create an account on AWS.
- Launch four Ec2 instance on AWS.
- Configure one instance as Name Node, one as Client and remaining two as Data Nodes.
- Install jdk and hadoop package in all instances.
- Configure “hdfs-site.xml” and “core-site.xml” file in both data nodes and name node. (Reminder, no need to configure “hdfs-site.xml” file in hadoop-client , only configure “core-site.xml” file).
- Format the Name Node.
- Start the Hadoop daemon services in both Data Nodes and Name Node and check by using “jps” command.
- Check Data Nodes available to the Hadoop-Cluster by using command
hadoop dfsadmin -report
- Hadoop-Client uploads a file to the Hadoop-Cluster by using command:-
hadoop fs -put <file_name> /
10. Check file in hadoop-cluster by using command:-
hadoop fs -ls /
11. While uploading file , RUN tcpdump command in NAME NODE and in both DATA NODE’S :
Firstly, install tcpdump package —
yum install tcpdump
Run tcpdump command to check the transferred packets between client , master and slaves —
tcpdump -i eth0 -n -x
tcpdump -i eth0 tcp port 22 -n
- It will shows that which one is requesting and which one is replying. While running the above command, you will get to know that In Name Node, Client is requesting Master(or Name Node) to get the IP-ADDRESSES of Data Nodes as Client is the one who directly uploads the data to the Data Nodes and Master is replying by sending the network packets to the Client which contains the IP-addresses of Data Nodes.
- To trace the DATA PACKETS (or data flow), use port no. 50010 & run command —
tcpdump -i eth0 port 50010 -n -x
- While running this command in both DATANODES and in NAMENODE , you will see some data packets are receiving to the Data Nodes. These data packets are received by data nodes in such a manner that Firstly , some packets received by DataNode1 and then, stops . After this, some packets received by DataNode2 and then, stops. Again data packets received to DN1 and when it stops ,then ,received by DN2 & so on.
- This process goes on till the whole file gets uploaded to the Hadoop-Cluster. This will help to uploads the data fast or you can also check its time-stamp during uploading of file in both the slaves , it will definitely differ as it is not transferring data in parallel.
- Thus, the data flow in data packets from CLIENT to the DATA NODE’S is in the serial order.
- So, We can say that Hadoop uses the concept of “serialism” to upload the split data while fulfilling Velocity problem.