1:Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out Answer: <property> With the right JVM size in your hadoop-site.xml , you will have to copy this 7:解決hadoop OutOfMemoryError問題: Java -Xms1024m -Xmx4096m 一般jvm的最大內(nèi)存使用應(yīng)該為總內(nèi)存大小的一半,我們使用的8G內(nèi)存,所以設(shè)置為4096m,這一值可能依舊不是最優(yōu)的值。 8: Namenode in safe mode 解決方法 bin/hadoop dfsadmin -safemode leave?9:?reduce exceed 100% "Reduce Task Progress shows > 100% when the total size of map outputs (for a single reducer) is high " 造成原因: 在reduce的merge過程中,check progress有誤差,導(dǎo)致status > 100%,在統(tǒng)計過程中就會出現(xiàn)以下錯誤:java.lang.ArrayIndexOutOfBoundsException: 3 at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.getReduceAvarageProgresses(StatusHttpServer.java:228) at org.apache.hadoop.mapred.StatusHttpServer$TaskGraphServlet.doGet(StatusHttpServer.java:159) …… 10:java.net.NoRouteToHostException: No route to host j解決方法: sudo /etc/init.d/iptables stop ?11:更改namenode后,在hive中運行select 依舊指向之前的namenode地址 這是因為:When youcreate a table, hive actually stores the location of the table (e.g. hdfs://ip:port/user/root/...) in the SDS and DBS tables in the metastore . So when I bring up a new cluster the master has a new IP, but hive's metastore is still pointing to the locations within the old cluster. I could modify the metastore to update with the new IP everytime I bring up a cluster. But the easier and simpler solution was to just use an elastic IP for the master 所以要將metastore中的之前出現(xiàn)的namenode地址全部更換為現(xiàn)有的namenode地址 12:Your DataNode is started and you can create directories with bin/hadoop dfs -mkdir, but you get an error message when you try to put files into the HDFS (e.g., when you run a command like bin/hadoop dfs -put). 解決方法: Go to the HDFS info web page (open your web browser and go to http://namenode:dfs_info_port where namenode is the hostname of your NameNode and dfs_info_port is the port you chose dfs.info.port; if followed the QuickStart on your personal computer then this URL will be http://localhost:50070). Once at that page click on the number where it tells you how many DataNodes you have to look at a list of the DataNodes in your cluster. If it says you have used 100% of your space, then you need to free up room on local disk(s) of the DataNode(s). If you are on Windows then this number will not be accurate (there is some kind of bug either in Cygwin's df.exe or in Windows). Just free up some more space and you should be okay. On one Windows machine we tried the disk had 1GB free but Hadoop reported that it was 100% full. Then we freed up another 1GB and then it said that the disk was 99.15% full and started writing data into the HDFS again. We encountered this bug on Windows XP SP2. 13:Your DataNodes won't start, and you see something like this in logs/*datanode*: Incompatible namespaceIDs in /tmp/hadoop-ross/dfs/data 原因: Your Hadoop namespaceID became corrupted. Unfortunately the easiest thing to do reformat the HDFS. 解決方法: You need to do something like this: bin/stop-all.sh rm -Rf /tmp/hadoop-your-username/* bin/hadoop namenode -format 14:You can run Hadoop jobs written in Java (like the grep example), but your HadoopStreaming jobs (such as the Python example that fetches web page titles) won't work. 原因: You might have given only a relative path to the mapper and reducer programs. The tutorial originally just specified relative paths, but absolute paths are required if you are running in a real cluster. 解決方法: Use absolute paths like this from the tutorial: bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \ -mapper $HOME/proj/hadoop/multifetch.py \ -reducer $HOME/proj/hadoop/reducer.py \ -input urls/* \ -output titles 15:09/08/31 18:25:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException:Bad connect ack with firstBadLink 192.168.1.11:50010 > 09/08/31 18:25:45 INFO hdfs.DFSClient: Abandoning block blk_-8575812198227241296_1001 > 09/08/31 18:25:51 INFO hdfs.DFSClient: Exception in createBlockOutputStream java.io.IOException: Bad connect ack with firstBadLink 192.168.1.16:50010 ……to create new block. > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2731) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1996) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2182) > > 09/08/31 18:26:09 WARN hdfs.DFSClient: Error Recovery for block blk_7193173823538206978_1001 bad datanode[2] nodes == null > 09/08/31 18:26:09 WARN hdfs.DFSClient: Could not get block locations. Source file "/user/umer/8GB_input" - Aborting... > put: Bad connect ack with firstBadLink 192.168.1.16:50010 解決方法: 1) '/etc/init.d/iptables stop' -->stopped firewall 2) SELINUX=disabled in '/etc/selinux/config' file.-->disabled selinux I worked for me after these two changes 16:解決jline.ConsoleReader.readLine在Windows上不生效問題方法 在 CliDriver.java的main()函數(shù)中,有一條語句reader.readLine,用來讀取標(biāo)準(zhǔn)輸入,但在Windows平臺上該語句總是返回null,這個reader是一個實例jline.ConsoleReader實例,給Windows Eclipse調(diào)試帶來不便。 我們可以通過使用java.util.Scanner.Scanner來替代它,將原來的 while ((line=reader.readLine(curPrompt+"> ")) != null) 復(fù)制代碼 替換為: Scanner sc = new Scanner(System.in); while ((line=sc.nextLine()) != null) 復(fù)制代碼 重新編譯發(fā)布,即可正常從標(biāo)準(zhǔn)輸入讀取輸入的SQL語句了。 17:IO寫操作出現(xiàn)問題 0-1246359584298, infoPort=50075, ipcPort=50020):Got exception while serving blk_-5911099437886836280_1292 to /172.16.100.165: java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/ 172.16.100.165:50010 remote=/172.16.100.165:50930] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:185) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) …… It seems there are many reasons that it can timeout, the example given in HADOOP-3831 is a slow reading client. 解決辦法:在hadoop-site.xml中設(shè)置dfs.datanode.socket.write.timeout=0試試; My understanding is that this issue should be fixed in Hadoop 0.19.1 so that we should leave the standard timeout. However until then this can help resolve issues like the one you're seeing. 18:?status of 255 error 錯誤類型: java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424) 錯誤原因: Set mapred.jobtracker.retirejob.interval and mapred.userlog.retain.hours to higher value. By default, their values are 24 hours. These might be the reason for failure, though I'm not sure ?restart 單個datanode 自己實際添加節(jié)點過程: 1. 先在slave上配置好環(huán)境,包括ssh,jdk,相關(guān)config,lib,bin等的拷貝; 2. 將新的datanode的host加到集群namenode及其他datanode中去; 3. 將新的datanode的ip加到master的conf/slaves中; 4. 重啟cluster,在cluster中看到新的datanode節(jié)點; 5. 運行bin/start-balancer.sh,這個會很耗時間 備注: 1. 如果不balance,那么cluster會把新的數(shù)據(jù)都存放在新的node上,這樣會降低mr的工作效率; 2. 也可調(diào)用bin/start-balancer.sh 命令執(zhí)行,也可加參數(shù) -threshold 5 threshold 是平衡閾值,默認(rèn)是10%,值越低各節(jié)點越平衡,但消耗時間也更長。 3. balancer也可以在有mr job的cluster上運行,默認(rèn)dfs.balance.bandwidthPerSec很低,為1M/s。在沒有mr job時,可以提高該設(shè)置加快負(fù)載均衡時間。 其他備注: 1. 必須確保slave的firewall已關(guān)閉; 2. 確保新的slave的ip已經(jīng)添加到master及其他slaves的/etc/hosts中,反之也要將master及其他slave的ip添加到新的slave的/etc/hosts中 mapper及reducer個數(shù) url地址: http://wiki./hadoop/HowManyMapsAndReduces mapper個數(shù)的設(shè)置:跟input file 有關(guān)系,也跟filesplits有關(guān)系,filesplits的上線為dfs.block.size,下線可以通過mapred.min.split.size設(shè)置,最后還是由InputFormat決定。 較好的建議: The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum).increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. <property> <name>mapred.tasktracker.reduce.tasks.maximum</name> <value>2</value> <description>The maximum number of reduce tasks that will be run simultaneously by a task tracker. </description> </property> 單個node新加硬盤 1.修改需要新加硬盤的node的dfs.data.dir,用逗號分隔新、舊文件目錄 2.重啟dfs 同步hadoop 代碼 hadoop-env.sh # host:path where hadoop code should be rsync'd from. Unset by default. # export HADOOP_MASTER=master:/home/$USER/src/hadoop 用命令合并HDFS小文件 hadoop fs -getmerge <src> <dest> 重啟reduce job方法 Introduced recovery of jobs when JobTracker restarts. This facility is off by default. Introduced config parameters "mapred.jobtracker.restart.recover", "mapred.jobtracker.job.history.block.size", and "mapred.jobtracker.job.history.buffer.size". 還未驗證過。 HDFS退服節(jié)點的方法 目前版本的dfsadmin的幫助信息是沒寫清楚的,已經(jīng)file了一個bug了,正確的方法如下: 1. 將 dfs.hosts 置為當(dāng)前的 slaves,文件名用完整路徑,注意,列表中的節(jié)點主機名要用大名,即 uname -n 可以得到的那個。 2. 將 slaves 中要被退服的節(jié)點的全名列表放在另一個文件里,如 slaves.ex,使用 dfs.host.exclude 參數(shù)指向這個文件的完整路徑 3. 運行命令 bin/hadoop dfsadmin -refreshNodes 4. web界面或 bin/hadoop dfsadmin -report 可以看到退服節(jié)點的狀態(tài)是 Decomission in progress,直到需要復(fù)制的數(shù)據(jù)復(fù)制完成為止 5. 完成之后,從 slaves 里(指 dfs.hosts 指向的文件)去掉已經(jīng)退服的節(jié)點 附帶說一下 -refreshNodes 命令的另外三種用途: 2. 添加允許的節(jié)點到列表中(添加主機名到 dfs.hosts 里來) 3. 直接去掉節(jié)點,不做數(shù)據(jù)副本備份(在 dfs.hosts 里去掉主機名) 4. 退服的逆操作——停止 exclude 里面和 dfs.hosts 里面都有的,正在進行 decomission 的節(jié)點的退服,也就是把 Decomission in progress 的節(jié)點重新變?yōu)?Normal (在 web 界面叫 in service) distribute cache使用 類似一個全局變量,但是由于這個變量較大,所以不能設(shè)置在config文件中,轉(zhuǎn)而使用distribute cache 具體使用方法:(詳見《the definitive guide》,P240) 1. 在命令行調(diào)用時:調(diào)用-files,引入需要查詢的文件(可以是local file, HDFS file(使用hdfs://xxx?)), 或者 -archives (JAR,ZIP, tar等) % hadoop jar job.jar MaxTemperatureByStationNameUsingDistributedCacheFile \ -files input/ncdc/metadata/stations-fixed-width.txt input/ncdc/all output 2. 程序中調(diào)用: public void configure(JobConf conf) { metadata = new NcdcStationMetadata(); try { metadata.initialize(new File("stations-fixed-width.txt")); } catch (IOException e) { throw new RuntimeException(e); } } 另外一種間接的使用方法:在hadoop-0.19.0中好像沒有 調(diào)用addCacheFile()或者addCacheArchive()添加文件, 使用getLocalCacheFiles() 或 getLocalCacheArchives() 獲得文件 hadoop的job顯示web There are web-based interfaces to both the JobTracker (MapReduce master) and NameNode (HDFS master) which display status pages about the state of the entire system. By default, these are located at [WWW] http://job.dr:50030/ and [WWW] http://name.dr:50070/. hadoop監(jiān)控 OnlyXP(52388483) 131702 用nagios作告警,ganglia作監(jiān)控圖表即可 split size FileInputFormat input splits: (詳見 《the definitive guide》P190) mapred.min.split.size: default=1, the smallest valide size in bytes for a file split. mapred.max.split.size: default=Long.MAX_VALUE, the largest valid size. dfs.block.size: default = 64M, 系統(tǒng)中設(shè)置為128M。 如果設(shè)置 minimum split size > block size, 會增加塊的數(shù)量。(猜想從其他節(jié)點拿去數(shù)據(jù)的時候,會合并block,導(dǎo)致block數(shù)量增多) 如果設(shè)置maximum split size < block size, 會進一步拆分block。 split size = max(minimumSize, min(maximumSize, blockSize)); 其中 minimumSize < blockSize < maximumSize. sort by value hadoop 不提供直接的sort by value方法,因為這樣會降低mapreduce性能。 但可以用組合的辦法來實現(xiàn),具體實現(xiàn)方法見《the definitive guide》, P250 基本思想: 1. 組合key/value作為新的key; 2. 重載partitioner,根據(jù)old key來分割; conf.setPartitionerClass(FirstPartitioner.class); 3. 自定義keyComparator:先根據(jù)old key排序,再根據(jù)old value排序; conf.setOutputKeyComparatorClass(KeyComparator.class); 4. 重載GroupComparator, 也根據(jù)old key 來組合; conf.setOutputValueGroupingComparator(GroupComparator.class); small input files的處理 對于一系列的small files作為input file,會降低hadoop效率。 有3種方法可以將small file合并處理: 1. 將一系列的small files合并成一個sequneceFile,加快mapreduce速度。 詳見WholeFileInputFormat及SmallFilesToSequenceFileConverter,《the definitive guide》, P194 2. 使用CombineFileInputFormat集成FileinputFormat,但是未實現(xiàn)過; 3. 使用hadoop archives(類似打包),減少小文件在namenode中的metadata內(nèi)存消耗。(這個方法不一定可行,所以不建議使用) 方法: 將/my/files目錄及其子目錄歸檔成files.har,然后放在/my目錄下 bin/hadoop archive -archiveName files.har /my/files /my 查看files in the archive: bin/hadoop fs -lsr har://my/files.har skip bad records JobConf conf = new JobConf(ProductMR.class); conf.setJobName("ProductMR"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Product.class); conf.setMapperClass(Map.class); conf.setReducerClass(Reduce.class); conf.setMapOutputCompressorClass(DefaultCodec.class); conf.setInputFormat(SequenceFileInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); String objpath = "abc1"; SequenceFileInputFormat.addInputPath(conf, new Path(objpath)); SkipBadRecords.setMapperMaxSkipRecords(conf, Long.MAX_VALUE); SkipBadRecords.setAttemptsToStartSkipping(conf, 0); SkipBadRecords.setSkipOutputPath(conf, new Path("data/product/skip/")); String output = "abc"; SequenceFileOutputFormat.setOutputPath(conf, new Path(output)); JobClient.runJob(conf); For skipping failed tasks try : mapred.max.map.failures.percent counters 3中counters: 1. built-in counters: Map input bytes, Map output records... 2. enum counters 調(diào)用方式: enum Temperature { MISSING, MALFORMED } reporter.incrCounter(Temperature.MISSING, 1) 結(jié)果顯示: 09/04/20 06:33:36 INFO mapred.JobClient: Air Temperature Recor 09/04/20 06:33:36 INFO mapred.JobClient: Malformed=3 09/04/20 06:33:36 INFO mapred.JobClient: Missing=66136856 3. dynamic countes: 調(diào)用方式: reporter.incrCounter("TemperatureQuality", parser.getQuality(),1); 結(jié)果顯示: 09/04/20 06:33:36 INFO mapred.JobClient: TemperatureQuality 09/04/20 06:33:36 INFO mapred.JobClient: 2=1246032 09/04/20 06:33:36 INFO mapred.JobClient: 1=973422173 09/04/20 06:33:36 INFO mapred.JobClient: 0=1 ? Windows eclispe調(diào)試hive報does not have a scheme錯誤可能原因 1、Hive配置文件中的“hive.metastore.local”配置項值為false,需要將它修改為true,因為是單機版 2、沒有設(shè)置HIVE_HOME環(huán)境變量,或設(shè)置錯誤 3、 “does not have a scheme”很可能是因為找不到“hive-default.xml”。使用Eclipse調(diào)試Hive時,遇到找不到hive- default.xml的解決方法:http://bbs./thread-292-1-1.html 1、中文問題 從url中解析出中文,但hadoop中打印出來仍是亂碼?我們曾經(jīng)以為hadoop是不支持中文的,后來經(jīng)過查看源代碼,發(fā)現(xiàn)hadoop僅僅是不支持以gbk格式輸出中文而己。 這是TextOutputFormat.class中的代碼,hadoop默認(rèn)的輸出都是繼承自FileOutputFormat來的,F(xiàn)ileOutputFormat的兩個子類一個是基于二進制流的輸出,一個就是基于文本的輸出TextOutputFormat。 public class TextOutputFormat<K, V> extends FileOutputFormat<K, V> { protected static class LineRecordWriter<K, V> implements RecordWriter<K, V> { private static final String utf8 = “UTF-8″;//這里被寫死成了utf-8 private static final byte[] newline; static { try { newline = “\n”.getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”); } } … public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { this.keyValueSeparator = keyValueSeparator.getBytes(utf8); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException(”can’t find ” + utf8 + ” encoding”); } } … private void writeObject(Object o) throws IOException { if (o instanceof Text) { Text to = (Text) o; out.write(to.getBytes(), 0, to.getLength());//這里也需要修改 } else { out.write(o.toString().getBytes(utf8)); } } … } 可以看出hadoop默認(rèn)的輸出寫死為utf-8,因此如果decode中文正確,那么將Linux客戶端的character設(shè)為utf-8是可以看到中文的。因為hadoop用utf-8的格式輸出了中文。 因為大多數(shù)數(shù)據(jù)庫是用gbk來定義字段的,如果想讓hadoop用gbk格式輸出中文以兼容數(shù)據(jù)庫怎么辦? 我們可以定義一個新的類: public class GbkOutputFormat<K, V> extends FileOutputFormat<K, V> { protected static class LineRecordWriter<K, V> implements RecordWriter<K, V> { //寫成gbk即可 private static final String gbk = “gbk”; private static final byte[] newline; static { try { newline = “\n”.getBytes(gbk); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”); } } … public LineRecordWriter(DataOutputStream out, String keyValueSeparator) { this.out = out; try { this.keyValueSeparator = keyValueSeparator.getBytes(gbk); } catch (UnsupportedEncodingException uee) { throw new IllegalArgumentException(”can’t find ” + gbk + ” encoding”); } } … private void writeObject(Object o) throws IOException { if (o instanceof Text) { // Text to = (Text) o; // out.write(to.getBytes(), 0, to.getLength()); // } else { out.write(o.toString().getBytes(gbk)); } } … } 然后在mapreduce代碼中加入conf1.setOutputFormat(GbkOutputFormat.class) 即可以gbk格式輸出中文。 2、某次正常運行mapreduce實例時,拋出錯誤 java.io.IOException: All datanodes xxx.xxx.xxx.xxx:xxx are bad. Aborting… at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2158) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889) java.io.IOException: Could not get block locations. Aborting… at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2143) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1735) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1889) 經(jīng)查明,問題原因是linux機器打開了過多的文件導(dǎo)致。用命令ulimit -n可以發(fā)現(xiàn)linux默認(rèn)的文件打開數(shù)目為1024,修改/ect/security/limit.conf,增加hadoop soft 65535 再重新運行程序(最好所有的datanode都修改),問題解決 3、運行一段時間后hadoop不能stop-all.sh的問題,顯示報錯 no tasktracker to stop ,no datanode to stop 問題的原因是hadoop在stop的時候依據(jù)的是datanode上的mapred和dfs進程號。而默認(rèn)的進程號保存在/tmp下,linux默認(rèn)會每隔一段時間(一般是一個月或者7天左右)去刪除這個目錄下的文件。因此刪掉hadoop-hadoop-jobtracker.pid和hadoop- hadoop-namenode.pid兩個文件后,namenode自然就找不到datanode上的這兩個進程了。 在配置文件中的export HADOOP_PID_DIR可以解決這個問題 問題: Incompatible namespaceIDs in /usr/local/hadoop/dfs/data: namenode namespaceID = 405233244966; datanode namespaceID = 33333244 原因: 在每次執(zhí)行hadoop namenode -format時,都會為NameNode生成namespaceID,,但是在hadoop.tmp.dir目錄下的DataNode還是保留上次的 namespaceID,因為namespaceID的不一致,而導(dǎo)致DataNode無法啟動,所以只要在每次執(zhí)行hadoop namenode -format之前,先刪除hadoop.tmp.dir目錄就可以啟動成功。請注意是刪除hadoop.tmp.dir對應(yīng)的本地目錄,而不是HDFS 目錄。 Problem: Storage directory not exist 2010-02-09 21:37:53,203 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory D:\hadoop\run\dfs_name_dir does not exist. 2010-02-09 21:37:53,203 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. org.apache.hadoop.hdfs.server.common.InconsistentFSStateException: Directory D:\hadoop\run\dfs_name_dir is in an inconsistent state: storage directory does not exist or is not accessible. solution: 是因為存儲目錄D:\hadoop\run\dfs_name_dir不存在,所以只需要手動創(chuàng)建好這個目錄即可。 Problem: NameNode is not formatted solution: 是因為HDFS還沒有格式化,只需要運行hadoop namenode -format一下,然后再啟動即可 bin/hadoop jps后報如下異常: 轉(zhuǎn)自:http://blog.csdn.net/zyj8170/archive/2010/11/26/6037934.aspx |
|