分类目录归档:Apache

Apache Kylin使用

部署或建cube出问题大部分都是环境问题或hadoop hbase 版本问题

首先按照文档中所标注的版本进行部署

  • Hadoop: 2.4 – 2.7
  • Hive: 0.13 – 0.14
  • HBase: 0.98 – 0.99
  • JDK: 1.7+

其中略坑的是 hbase的0.9*版本是不支持Hadoop 2.7,的,若hadoop是2.7.*,需要部署hbase 1.* ,对于hbase 1.* 版本需要下载单独编译的kylin二进制包

Binary Package (for running on HBase 1.1.3 or above)

创建Cube时执行 job 出错

1.
native snappy library not available: SnappyCompressor has not been loaded.

原因是hadoop native lib少了snappy解压缩库

sudo yum install snappy snappy-devel
sudo ln -s /usr/lib64/libsnappy.so $HADOOP_HOME/lib/native/libsnappy.so

在 $HADOOP_HOME/etc/hadoop/hadoop-env.sh 增加

export JAVA_LIBRARY_PATH="/usr/local/hadoop/lib/native"

重启

$HADOOP_HOME/sbin/stop-all.sh
$HADOOP_HOME/sbin/start-all.sh

2.

2016-02-22 16:24:16,740 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
	at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
	... 8 more

解决方法
http://stackoverflow.com/questions/34449561/hadoop-map-reduce-job-class-org-apache-hive-hcatalog-mapreduce-hcatinputformat

The issue here is Kylin assumes the same Hive jars on all Hadoop nodes. And when certain node missing the Hive jars (or even in different location), you get the ClassNotFoundException on HCatInputFormat.

Btw, you should be able to get a clear error message from Yarn job console. This is a met issue.

Deploying Hive to all cluster nodes can surely fix the problem, like you have tried.

Or another (cleaner) workaround is manually configure Kylin to submit Hive jars as additional job dependencies. See https://issues.apache.org/jira/browse/KYLIN-1021

Finally there's also a open JIRA suggests that Kylin should submit Hive jars by default. See https://issues.apache.org/jira/browse/KYLIN-1082

3.

org.apache.kylin.job.exception.ExecuteException: org.apache.kylin.job.exception.ExecuteException: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:111)
        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:130)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kylin.job.exception.ExecuteException: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:111)
        at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:51)
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        ... 4 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.hadoop.yarn.conf.HAUtil.getConfKeyForRMInstance(HAUtil.java:239)
        at org.apache.hadoop.yarn.conf.HAUtil.getConfValueForRMInstance(HAUtil.java:250)
        at org.apache.hadoop.yarn.conf.HAUtil.getConfValueForRMInstance(HAUtil.java:262)
        at org.apache.kylin.job.common.MapReduceExecutable.getRestStatusCheckUrl(MapReduceExecutable.java:191)
        at org.apache.kylin.job.common.MapReduceExecutable.doWork(MapReduceExecutable.java:135)
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        ... 6 more

拷贝新版本的hadoop-yarn-api.*.jar 至hbase/lib  hbase版本不对

4.

kylin 报错: org.apache.hadoop.hbase.TableNotFoundException: Table KYLIN_* is not currently available.
           Load HFile to HBase Table failed

查看hbase log,看了下是snappy的问题,把kylin的压缩方式改为gzip,重启好了,可以顺利建cube了

5.提示 kylin_metadata 已存在 不能在hbase中创建表,然而hbase中又看不到该表

删除zookeeper 上hbase下该表node

./zkClient.sh

rmr /hbase-unsecure/table/kylin_metadata

Hive/Beeline 使用笔记

Hive:

利用squirrel-sql 连接hive

add driver -> name&example url(jdbc:hive2://xxx:10000)->extra class path ->Add

{hive/lib/hive-common-*.jar

hive/lib/hive-contrib-*.jar

hive/lib/hive-jdbc-*.jar

hive/lib/libthrift-*.jar

hive/lib/hive-service-*.jar

hive/lib/httpclient-*.jar

hive/lib/httpcore-*.jar

hadoop/share/hadoop/common/hadoop-common–*.jar

hadoop/share/hadoop/common/lib/common-configuration-*.jar

hadoop/share/hadoop/common/lib/log4j-*.jar

hadoop/share/hadoop/common/lib/slf4j-api-*.jar

hadoop/share/hadoop/common/lib/slf4j-log4j-*.jar}

->List Drivers(wait ..then class name will auto set org.apache.hive.jdbc/HiveDriver)->OK->Add aliases ->chose the hive driver->done

 

Hive数据迁移

1.导出表

EXPORT TABLE <table_name> TO 'path/to/hdfs';

2.复制数据到另一个hdfs

hadoop distcp hdfs://:8020/path/to/hdfs hdfs:///path/to/hdfs

3.导入表

IMPORT TABLE <table_name> FROM 'path/to/another/hdfs';

Hive 输出查询结果到文件

输出到本地文件:

insert overwrite local directory ‘./test-04’
row format delimited
FIELDS TERMINATED BY ‘\t’
COLLECTION ITEMS TERMINATED BY ‘,’
MAP KEYS TERMINATED BY ‘:’
select * from src;

输出到hdfs:

输出到hdfs好像不支持 row format,只能另辟蹊径了

INSERT OVERWRITE DIRECTORY ‘/outputable.txt’
select concat(col1, ‘,’, col2, ‘,’, col3) from myoutputtable;

当然默认的分隔符是\001

若要直接对文件进行操作课直接用stdin的形式

eg. hadoop fs -cat ../000000_0 |python doSomeThing.py

#!/usr/bin/env python

import sys

for line in sys.stdin:

    (a,b,c)=line.strip().split(‘\001’)

 

Hive 语法:

hive好像不支持select dicstinct col1 as col1 from table group by col1

需要用grouping sets

select col1 as col1 from table group by col1 grouping sets((col1))

Beeline:

文档:https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

利用jdbc连接hive:

hive2=’JAVA_HOME=/opt/java7 HADOOP_HOME=/opt/hadoop /opt/hive/bin/beeline -u jdbc:hive2://n1.hd2.host.dxy:10000 -n hadoop -p fake -d org.apache.hive.jdbc.HiveDriver –color=true –silent=false –fastConnect=false –verbose=true’

beeline利用jdbc连接hive若需要执行多条命令使用

hive2 -e “xxx” -e “yyy”  -e…

Hadoop2.5.1使用笔记

只是记录遇到的问题及解决方法:

1.安装教程:http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/

配置hadoop遇到的问题:

执行hadoop version   
localhost: Error: JAVA_HOME is not set and could not be found.

直接命令行执行export JAVA_HOME=/PATH/TO/JDK无法解决问题

grep命令发现libexec/hadoop-config.sh文件中有错误提示中的黑体字,于是在对应位置前直接设置export JAVA_HOME=/PATH/TO/JDK

错误消失。
执行start-dfs.sh   
Error: Could not find or load main class org.apache.hadoop.util.VersionInfo

一般是新版本没有包含相应的jar包,检查是否缺少 hadoop/share/hadoop/*(common...) 缺少则下载旧版本里的包放入,检查HADOOP_HOME,HADOOP_COMMON_HOME,HADOOP_PREFIX 环境变量是否有配置

2.新版本的hadoop streaming 库地址:$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar

3.使用hadoop streaming 同一键值需要自己合并

4.Hadoop Core(hadoop-core-*.jar) is renamed Hadoop Common. location:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.5.1.jar

报错:

1.java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

    解决方案:程序有错, 检查 http://localhost:50070/logs/userlogs/下的日志看container日志中的stderr,然后具体问题具体解决


Nutch中使用自定义中文分词器

原文:http://xtfncel.iteye.com/blog/756763

Nutch中使用自定义中文分词器

Nutch对中文查询时默认采用的分词器为NutchAnalyzer,对中文默认采用单字切分.这种效果不是很理想,我们可以自定义切词器,以实现对中文支持.

通常可以采用的两种方式添加对中文的支持:

1.采用插件的方式,不修改系统代码的基础上,编写中文分词插件实现对中文分词的支持。

2.直接修改nutch的系统代码,对默认的分词器代码进行修改使其使用自定义中文分词程序.

继续阅读

Nutch中文分词总结

1 中文分词介绍

中文分词是在做检索类系统时需要重点考虑的一个因素。Nutch的本土化过程也需要更改对中文分词的支持。目前,Nutch中文分词方式大致有两种方式:

一是修改源代码。这种方式是直接对Nutch分词处理类进行修改,调用已写好的一些分词组件进行分词。

二是编写分词插件。这种方式是按照Nutch定义的插件编写规则重新编写或者添加中文分词插件。

以上两种方式都是可取的。目前,由于开源社区的活跃,已经有很多种分词组件的出现,无论是修改源代码的方式还是编写分词插件的方式,都是依赖于这些分词组件的。下面列出了主要的一些分词组件:

继续阅读

windows下安装nutch1.6时遇到的一些问题小记

1.nutch1.2版本后没有crawl-urlfilter.txt这个文件,

不过多了regex-urlfilter.txt,在这个里面加入

+^http://([a-z0-9]*.)*163.com  这个代表只抓取www.163.com的内容

然后在nutch-site.xml

<property>

<name>http.agent.name</name>

<value>MSpider</value>

</property>

<property>

<name>http.agent.url</name>

<value>www.163.com</value>

</property>

2.修改java运行参数

打开Run configurations..

Run > Run Configurations…–> New Java Application-
Main class–>org.apache.nutch.crawl.Crawl
Program arguments–>urls -dir myPages -depth 2 -topN 50 或 urls -dir crawl -depth 5 -threads 4 -topN 10

然后在VM arguments输入-Xms64m -Xmx512m

3.在windows远程调试linux的程序时,报错:

Exception in thread “main” java.io.IOException: Failed to set permissions of path: tmphadoop-AdministratormapredstagingAdministrator-4954228.staging to 0700
at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689)
at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662)
at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:344)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:189)
at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:116)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Unknown Source)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:850)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:500)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:530)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:191)
at org.apache.nutch.crawl.Crawler.runTool(Crawler.java:68)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:152)
at org.apache.nutch.crawl.Crawler.run(Crawler.java:250)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawler.main(Crawler.java:257)

这个是Windows下文件权限问题,在Linux下可以正常运行,不存在这样的问题。

解决方法是,修改/hadoop-1.0.2/src/core/org/apache/hadoop/fs/FileUtil.java里面的checkReturnValue,注释掉即可(有些粗暴,在Window下,可以不用检查):

……privatestaticvoidcheckReturnValue(booleanrv,Filep,FsPermissionpermission)throwsIOException{/** if (!rv) { throw new IOException(“Failed to set permissions of path: ” + p + ” to ” + String.format(“%04o”, permission.toShort())); } **/}……

重新编译打包hadoop-core-1.0.2.jar,替换掉hadoop-1.0.2根目录下的hadoop-core-1.0.2.jar即可。

这里提供一份修改版的hadoop-core-1.0.2-modified.jar文件,替换原hadoop-core-1.0.2.jar即可。

替换之后,刷新项目,设置好正确的jar包依赖,现在再运行WordCountTest,即可。

成功之后,在Eclipse下刷新HDFS目录,可以看到生成了ouput2目录:

 

https://skydrive.live.com/?cid=cf7746837803bc50&id=CF7746837803BC50%211276

一些Nutch的学习资料

1)Nutch简介及安装

(1)Nutch1.4安装及测试

地址:http://blog.csdn.net/lidexin2003/article/details/7316188

(2)Linux下单机配置Nutch

地址:http://blog.csdn.net/zklth/article/details/5618948

(3)Linux下安装tomcat

地址:http://blog.sina.com.cn/s/blog_59b976a10100a3w1.html

(4)CentOS tomcat普通用户执行配置安装

地址:http://os.51cto.com/art/201002/184813.htm

(5)Linux下Tomcat的安装

地址:http://wanwentao.blog.51cto.com/2406488/548796

(6)nutch百度百科

地址:http://baike.baidu.com/view/46642.htm

2)Solr简介及安装

(1)不错的 solr 使用安装介绍

地址:http://www.cnblogs.com/cy163/archive/2009/09/18/1569681.html

(2)Linux下部署Solr

地址:http://www.cnblogs.com/TerryLiang/archive/2009/09/12/1565359.html

(3)Lucene & Solr

地址:http://www.cnblogs.com/limei/archive/2011/10/09/2204024.html

(4)ubuntu下Apache Solr的安装流程

地址:http://blog.csdn.net/hzcyclone/article/details/7026958

(5)solr 3.5 配置及应用(一)

地址:http://blog.csdn.net/chunming8302/article/details/7321367

(6)Solr 3.5 + Tomcat7 + mmseg4j + 搜狗词库 -配置并运行

地址:http://www.cnblogs.com/hamy/archive/2012/02/17/2355413.html

(7) Linux上 Nutch 1.3 + Solr 3.4.0 + Tomcat6.0 搜索引擎单节点搭建笔记

地址:http://blog.csdn.net/sikangtina/article/details/7104599

3)Nutch简单应用

(1)nutch1.0各种命令

地址:http://blog.csdn.net/ninjuli/article/details/4201055

(2)nutch命令详解

地址:http://hi.baidu.com/867862605/blog/item/bf65f51898460105403417c0.html

(3)Nutch 1.3 学习笔记2

地址:http://blog.csdn.net/amuseme_lu/article/details/6710221

4)Eclipse开发配置

(1)Eclipse安装ivyDe插件

地址:http://blog.csdn.net/lunwenbiji/article/details/7259150

(2)Nutch学习笔记

地址:http://www.cnblogs.com/streamhope/category/310177.html

(3)备忘录:Nutch+eclipse配置

地址:http://blog.csdn.net/wuyamony/article/details/7181491

(4)Nutch1.4 windows下eclipse配置图文详解

地址:http://wenku.baidu.com/view/d3b61414f18583d049645982.html

(5)nutch1.4在eclipse里配置

地址:http://www.haogongju.net/art/1225307

(6)eclipse部署配置nutch1.3

地址:http://wangwei3.iteye.com/blog/816056

(7)WinXP+cygwin+eclipse配置nutch-1.2,添加IKAnalyzer中文分词

地址:http://blog.csdn.net/neo_2011/article/details/6943622

(8)nutch1.1导入eclipse中运行

地址:http://xtfncel.iteye.com/blog/756735

(9)win7下cygwin + Eclipse + Nutch1.2 + Tomcat6 安装配置

地址:http://blog.csdn.net/eaglex/article/details/6314450

转-Nutch二次开发学习笔记

原文地址:http://ileo.me/2012/03/18/nutch1-2/

此笔记分为四个部分

1.Nutch1.2二次开发详细攻略(一)【图文】——Windows平台下Cygwin环境的搭建

2.Nutch1.2二次开发详细攻略(二)【图文】——Windows平台下Nutch1.2的搭建

3.Nutch1.2二次开发详细攻略(三)【图文】——Nutch1.2二次开发(关于界面修改)

4.Nutch1.2二次开发详细攻略(四)【图文】——Nutch1.2二次开发(关于中文分词)

 

由博主在2011-07-27 整理发布在原博客上(http://www.cnblogs.com/streamhope),由于图片和格式排版问题,就没有转发到新的博客。

当然你也可以下载博主整理的PDF文档

点击下载Nutch1.2二次开发攻略

另外博主为你准备了一些基础文档,有助于你更好的理解工作机理。

点击下载Nutch入门学习

点击下载Nutch插件机制分析

点击下载Nutch0.8.1 API帮助文档

希望以上资料对大家有所帮助,欢迎大家一起交流。