windows下安装nutch1.6时遇到的一些问题小记

1.nutch1.2版本后没有crawl-urlfilter.txt这个文件,

不过多了regex-urlfilter.txt,在这个里面加入

+^http://([a-z0-9]*.)*163.com  这个代表只抓取www.163.com的内容

然后在nutch-site.xml

<property>

<name>http.agent.name</name>

<value>MSpider</value>

</property>

<property>

<name>http.agent.url</name>

<value>www.163.com</value>

</property>

2.修改java运行参数

打开Run configurations..

Run > Run Configurations…–> New Java Application-
Main class–>org.apache.nutch.crawl.Crawl
Program arguments–>urls -dir myPages -depth 2 -topN 50 或 urls -dir crawl -depth 5 -threads 4 -topN 10

然后在VM arguments输入-Xms64m -Xmx512m

3.在windows远程调试linux的程序时,报错:

Exception in thread “main” java.io.IOException: Failed to set permissions of path: tmphadoop-AdministratormapredstagingAdministrator-4954228.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

这个是Windows下文件权限问题,在Linux下可以正常运行,不存在这样的问题。

解决方法是,修改/hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue,注释掉即可(有些粗暴,在Window下,可以不用检查):

……privatestaticvoidcheckReturnValue(booleanrv,Filep,FsPermissionpermission)throwsIOException{/** if (!rv) { throw new IOException(“Failed to set permissions of path: ” + p + ” to ” + String.format(“%04o”, permission.toShort())); } **/}……

重新编译打包hadoop-core-1.0.2.jar,替换掉hadoop-1.0.2根目录下的hadoop-core-1.0.2.jar即可。

这里提供一份修改版的hadoop-core-1.0.2-modified.jar文件,替换原hadoop-core-1.0.2.jar即可。

替换之后,刷新项目,设置好正确的jar包依赖,现在再运行WordCountTest,即可。

成功之后,在Eclipse下刷新HDFS目录,可以看到生成了ouput2目录:

 

https://skydrive.live.com/?cid=cf7746837803bc50&id=CF7746837803BC50%211276

windows下安装nutch1.6时遇到的一些问题小记》上有2条评论

发表评论

电子邮件地址不会被公开。 必填项已用*标注