Complete Apache Hadoop Troubleshooting

2
7335
Complete Apache Hadoop Troubleshooting
Complete Apache Hadoop Troubleshooting

Since it is pretty hard to find answers when facing with Apache Hadoop problems, I write this article to be a complete Apache Hadoop troubleshooting.

Article Contents

Apache Hadoop Troubleshooting

Note: at the time of this writing, Apache Hadoop 3.2.1 is the latest version, I will use it as a standard version for troubleshooting, therefore, some solutions might not work with prior versions.

Question 1: How to install Apache Hadoop version 3 and above on Mac?

Try to following my complete guide to install and configure Apache Hadoop on Mac.

Question 2: How to debug Apache Hadoop on development machine?

Open mapred-site.xml, update the value of mapreduce.framework.name to local.

It should become like below.

<property>
    <name>mapreduce.framework.name</name>
    <value>local</value>
</property>

Question 3: Is it possible to debug Apache Hadoop using yarn map-reduce framework?

The answer is NO, or it is not possible for now. Someone might figure it out in future.

Question 4: How to monitor clusters and jobs?

On Apache Hadoop v3+, it is on localhost:8088, assuming localhost is namenode on your Hadoop machine.

Question 5: After setting up Apache Hadoop, any hadoop command issued shows following error : ERROR: Invalid HADOOP_COMMON_HOME. What happen?

If I’m not mistaken, it looks like you follow wrong or out-of-date Hadoop installation tutorial. Commonly, prior to Hadoop v3, installation often requires to export HADOOP_HOME environment. However, this is a NO-NO for Hadoop v3.

The solution is to drop that HADOOP_HOME, which means:

$ export HADOOP_HOME=

Unset it, then it’s done.

If you’re on a Mac and problem still happens, try to follow my guide on setting up Apache Hadoop on Mac.

Question 6: Some warning about unable to load native-hadoop library always displays. Is it a problem?

The warning is like this:

2019-11-17 14:33:48,664 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

To answer in short, no, it’s nothing critical and it is not an error.

By default, Hadoop binary will try to look for if there is such native binary (like Linux share library libhadoop.so) , otherwise, it will use Hadoop Java-version implementation in Hadoop distribution package. It should be somewhere inside HADOOP_INSTALLATION_DIR/libexec/share/hadoop. They’re all JAR packages.

It is just a warning. If you feel it annoying, you can turn if OFF by appending following line into libexec/etc/hadoop/log4j.properties.

log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR

It won’t bother you anymore.

Question 7: How to know if all Hadoop servers like NameNode, DataNode… are running?

Use jps. It will show all server processes and names. For examples,

38529 Jps
34116 DataNode
34473 ResourceManager
34011 NameNode
34252 SecondaryNameNode
34575 NodeManager

So for a pseudo-distributed single cluster mode, you have to have at least 5 servers like above output shown.

Question 8: What are the differences between fs.default.name and fs.defaultFS?

Well, nothing much, just a rename or in other words, fs.default.name is deprecated (you see this in Hadoop v2), and fs.defaultFS is correct one from Hadoop v3+.

Question 9: How to know if some configuration properties are deprecated and the update equivalent?

Just visit Apache Hadoop Deprecated Properties page.

Question 10: How to configure custom directories for DataNode storage and NameNode instead of default directory?

Just update these following two properties in hdfs-site.xml, dfs.datanode.data.dir and dfs.namenode.name.dir.

Question 10: What is hadoop.tmp.dir in core-site.xml for?

Hmmm, it is to store temporary files locally. If you don’t configure directory for DataNode and NameNode, then it it will created under hadoop.tmp.dir.

Basically, anything that you don’t configure manually like cache, datanode directory, namenode directory…, just use Hadoop default configuration, then they will be put under hadoop.tmp.dir.

Question 11: I don’t see anything in my FS checkpoint directory. What’s wrong?

Nothing wrong, you probably use the wrong configuration key. It looks like you’re using fs.checkpoint.dir, which is deprecated. The update one should be dfs.namenode.checkpoint.dir.

Question 12: When I execute any command, it throws exception with something like Name node is in safe mode. What’s wrong?

There are two possible cases that I’ve known of.

The first one is that, NameNode is still bootstrapping and not ready yet and you execute command too fast. So, restart it again, and wait for a while, like 30s, then try again.

The second case is that, something did happen and it made Namenode goes into safemode. You can just turn it off by issuing following command:

$ sudo hdfs dfsadmin -safemode leave

Question 13: How to know if NameNode is in safemode?

Issue this command:

$ sudo hdfs dfsadmin -safemode get

Question 14: How to enter safemode?

Nothing fancy,

$ sudo hdfs dfsadmin -safemode enter

Question 15: When executing JAR packages, I have this error Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster. What’s the problem?

The detail of error is following.

Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>

To fix it, following the error suggestion, edit the mapred-site.xml, add those properties and point HADOOP_MAPRED_HOME to your Hadoop installation directory.

On Linux, it is probably /opt/hadoop.

On Mac, it should be under /usr/local/Cellar/hadoop/3.2.1/libexec. If you install Hadoop in different directory then make sure to point it to libexec. Otherwise, the error will show again.

Question 16: How to know if all cluster nodes are registered on YARN?

Try this command,

$ yarn node -list

If no node IP doesn’t show, then you have problem with configuration.

Question 17: It says localhost: Permission denied (publickey,password) when I try to start-dfs.sh. What’s wrong?

It looks like you haven’t configure localhost SSH yet. Try to create one.

Following are guides for Mac, Linux can be done similarly.

First, visit home directory to see if there is any id_rsa and id_rsa.pub inside ~/.ssh directory.

If not then do following, otherwise, do the last step with authorized_keys.

Generate a key pair:

$ ssh-keygen -t rsa

Make sure to use empty passphrase, or don’t provide password for the key. However, should not do this for production environment.

Add public key to allowed list:

$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

On Mac, go to System Preferences > Sharing. Turn ON Remote Login, and choose All Users to make sure it work. Or, if you’re on Terminal already, try this command to avoid UI navigation,

$ sudo systemsetup -setremotelogin on

Verify it by opening a new Terminal window, and type:

$ ssh localhost

If the error Permissionn denied doesn’t show, then you can start Hadoop now, start-dfs.sh.

Question 18: I submit a job to YARN, it get stucked at ACCEPTED / RUNNING state and there is no unhealthy node. What’s wrong?

It might be a misconfiguration in shuffle. Add this on yarn-site.xml.

<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>

If above property doesn’t help, then verify if you have following two properties in yarn-site.xml:

  • yarn.nodemanager.resource.memory-mb
  • yarn.scheduler.minimum-allocation-mb

If you have them, then remove.

Question 19: I have unhealthy nodes, what should I do?

It looks like there is not enough resource for AM to allocate for the job.

You might want to increase the value of following property on yarn-site.xml.

<property>
    <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
    <value>99.0</value>
</property>

By default, it is 90%. You might want to increase the disk storage percentage, which determines whether node is healthy or not. Max is 100%.

I prefer using 99%.

Additionally, make sure to provide the env-whitelist property.

<property>
    <name>yarn.nodemanager.env-whitelist</name>
    <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>

Just copy exactly like above to your yarn-site.xml file.

Question 20: How to kill a job?

You might want to do this if you have some stucked jobs that hang forever.

Use following commands to list and kill:

$ mapred job -list

// list of job will be display here with Job ID

$ mapred job -kill JOB_ID

Conclusion

Hope this complete Apache Hadoop troubleshooting help to fix your problems. I will try to update more if I get more into Hadoop troubles or some I know of.

If your problems don’t have on this Apache Hadoop troubleshooting list, tell me, and I will try to figure it out for you.