Since it is pretty hard to find answers when facing with Apache Hadoop problems, I write this article to be a complete Apache Hadoop troubleshooting.
Article Contents
Apache Hadoop Troubleshooting
Note: at the time of this writing, Apache Hadoop 3.2.1 is the latest version, I will use it as a standard version for troubleshooting, therefore, some solutions might not work with prior versions.
Question 1: How to install Apache Hadoop version 3 and above on Mac?
Try to following my complete guide to install and configure Apache Hadoop on Mac.
Question 2: How to debug Apache Hadoop on development machine?
Open mapred-site.xml
, update the value of mapreduce.framework.name
to local
.
It should become like below.
<property>
<name>mapreduce.framework.name</name>
<value>local</value>
</property>
Question 3: Is it possible to debug Apache Hadoop using yarn
map-reduce framework?
The answer is NO, or it is not possible for now. Someone might figure it out in future.
Question 4: How to monitor clusters and jobs?
On Apache Hadoop v3+, it is on localhost:8088
, assuming localhost
is namenode on your Hadoop machine.
Question 5: After setting up Apache Hadoop, any hadoop
command issued shows following error : ERROR: Invalid HADOOP_COMMON_HOME
. What happen?
If I’m not mistaken, it looks like you follow wrong or out-of-date Hadoop installation tutorial. Commonly, prior to Hadoop v3, installation often requires to export HADOOP_HOME
environment. However, this is a NO-NO for Hadoop v3.
The solution is to drop that HADOOP_HOME
, which means:
$ export HADOOP_HOME=
Unset it, then it’s done.
If you’re on a Mac and problem still happens, try to follow my guide on setting up Apache Hadoop on Mac.
Question 6: Some warning about unable to load native-hadoop library always displays. Is it a problem?
The warning is like this:
2019-11-17 14:33:48,664 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
To answer in short, no, it’s nothing critical and it is not an error.
By default, Hadoop binary will try to look for if there is such native binary (like Linux share library libhadoop.so
) , otherwise, it will use Hadoop Java-version implementation in Hadoop distribution package. It should be somewhere inside HADOOP_INSTALLATION_DIR/libexec/share/hadoop
. They’re all JAR packages.
It is just a warning. If you feel it annoying, you can turn if OFF by appending following line into libexec/etc/hadoop/log4j.properties
.
log4j.logger.org.apache.hadoop.util.NativeCodeLoader=ERROR
It won’t bother you anymore.
Question 7: How to know if all Hadoop servers like NameNode
, DataNode
… are running?
Use jps
. It will show all server processes and names. For examples,
38529 Jps
34116 DataNode
34473 ResourceManager
34011 NameNode
34252 SecondaryNameNode
34575 NodeManager
So for a pseudo-distributed single cluster mode, you have to have at least 5 servers like above output shown.
Question 8: What are the differences between fs.default.name
and fs.defaultFS
?
Well, nothing much, just a rename or in other words, fs.default.name
is deprecated (you see this in Hadoop v2), and fs.defaultFS
is correct one from Hadoop v3+.
Question 9: How to know if some configuration properties are deprecated and the update equivalent?
Just visit Apache Hadoop Deprecated Properties page.
Question 10: How to configure custom directories for DataNode storage and NameNode instead of default directory?
Just update these following two properties in hdfs-site.xml
, dfs.datanode.data.dir
and dfs.namenode.name.dir
.
Question 10: What is hadoop.tmp.dir
in core-site.xml
for?
Hmmm, it is to store temporary files locally. If you don’t configure directory for DataNode
and NameNode
, then it it will created under hadoop.tmp.dir
.
Basically, anything that you don’t configure manually like cache, datanode directory, namenode directory…, just use Hadoop default configuration, then they will be put under hadoop.tmp.dir
.
Question 11: I don’t see anything in my FS checkpoint directory. What’s wrong?
Nothing wrong, you probably use the wrong configuration key. It looks like you’re using fs.checkpoint.dir
, which is deprecated. The update one should be dfs.namenode.checkpoint.dir
.
Question 12: When I execute any command, it throws exception with something like Name node is in safe mode
. What’s wrong?
There are two possible cases that I’ve known of.
The first one is that, NameNode
is still bootstrapping and not ready yet and you execute command too fast. So, restart it again, and wait for a while, like 30s, then try again.
The second case is that, something did happen and it made Namenode
goes into safemode
. You can just turn it off by issuing following command:
$ sudo hdfs dfsadmin -safemode leave
Question 13: How to know if NameNode
is in safemode?
Issue this command:
$ sudo hdfs dfsadmin -safemode get
Question 14: How to enter safemode?
Nothing fancy,
$ sudo hdfs dfsadmin -safemode enter
Question 15: When executing JAR packages, I have this error Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
. What’s the problem?
The detail of error is following.
Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
Error: Could not find or load main class org.apache.hadoop.mapreduce.v2.app.MRAppMaster
Please check whether your etc/hadoop/mapred-site.xml contains the below configuration:
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=${full path of your hadoop distribution directory}</value>
</property>
To fix it, following the error suggestion, edit the mapred-site.xml
, add those properties and point HADOOP_MAPRED_HOME
to your Hadoop installation directory.
On Linux, it is probably /opt/hadoop
.
On Mac, it should be under /usr/local/Cellar/hadoop/3.2.1/libexec
. If you install Hadoop in different directory then make sure to point it to libexec
. Otherwise, the error will show again.
Question 16: How to know if all cluster nodes are registered on YARN?
Try this command,
$ yarn node -list
If no node IP doesn’t show, then you have problem with configuration.
Question 17: It says localhost: Permission denied (publickey,password)
when I try to start-dfs.sh
. What’s wrong?
It looks like you haven’t configure localhost SSH yet. Try to create one.
Following are guides for Mac, Linux can be done similarly.
First, visit home directory to see if there is any id_rsa
and id_rsa.pub
inside ~/.ssh
directory.
If not then do following, otherwise, do the last step with authorized_keys
.
Generate a key pair:
$ ssh-keygen -t rsa
Make sure to use empty passphrase, or don’t provide password for the key. However, should not do this for production environment.
Add public key to allowed list:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
On Mac, go to System Preferences > Sharing
. Turn ON Remote Login, and choose All Users to make sure it work. Or, if you’re on Terminal already, try this command to avoid UI navigation,
$ sudo systemsetup -setremotelogin on
Verify it by opening a new Terminal window, and type:
$ ssh localhost
If the error Permissionn denied
doesn’t show, then you can start Hadoop now, start-dfs.sh
.
Question 18: I submit a job to YARN, it get stucked at ACCEPTED / RUNNING state and there is no unhealthy node. What’s wrong?
It might be a misconfiguration in shuffle. Add this on yarn-site.xml
.
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
If above property doesn’t help, then verify if you have following two properties in yarn-site.xml
:
yarn.nodemanager.resource.memory-mb
yarn.scheduler.minimum-allocation-mb
If you have them, then remove.
Question 19: I have unhealthy nodes, what should I do?
It looks like there is not enough resource for AM to allocate for the job.
You might want to increase the value of following property on yarn-site.xml
.
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99.0</value>
</property>
By default, it is 90%. You might want to increase the disk storage percentage, which determines whether node is healthy or not. Max is 100%.
I prefer using 99%.
Additionally, make sure to provide the env-whitelist
property.
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
Just copy exactly like above to your yarn-site.xml
file.
Question 20: How to kill a job?
You might want to do this if you have some stucked jobs that hang forever.
Use following commands to list and kill:
$ mapred job -list
// list of job will be display here with Job ID
$ mapred job -kill JOB_ID
Conclusion
Hope this complete Apache Hadoop troubleshooting help to fix your problems. I will try to update more if I get more into Hadoop troubles or some I know of.
If your problems don’t have on this Apache Hadoop troubleshooting list, tell me, and I will try to figure it out for you.