分类目录归档:Hadoop

Apache Kylin使用

部署或建cube出问题大部分都是环境问题或hadoop hbase 版本问题

首先按照文档中所标注的版本进行部署

  • Hadoop: 2.4 – 2.7
  • Hive: 0.13 – 0.14
  • HBase: 0.98 – 0.99
  • JDK: 1.7+

其中略坑的是 hbase的0.9*版本是不支持Hadoop 2.7,的,若hadoop是2.7.*,需要部署hbase 1.* ,对于hbase 1.* 版本需要下载单独编译的kylin二进制包

Binary Package (for running on HBase 1.1.3 or above)

创建Cube时执行 job 出错

1.
native snappy library not available: SnappyCompressor has not been loaded.

原因是hadoop native lib少了snappy解压缩库

sudo yum install snappy snappy-devel
sudo ln -s /usr/lib64/libsnappy.so $HADOOP_HOME/lib/native/libsnappy.so

在 $HADOOP_HOME/etc/hadoop/hadoop-env.sh 增加

export JAVA_LIBRARY_PATH="/usr/local/hadoop/lib/native"

重启

$HADOOP_HOME/sbin/stop-all.sh
$HADOOP_HOME/sbin/start-all.sh

2.

2016-02-22 16:24:16,740 WARN [main] org.apache.hadoop.mapred.YarnChild: Exception running child : java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
	at org.apache.hadoop.mapreduce.task.JobContextImpl.getInputFormatClass(JobContextImpl.java:174)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:749)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
	at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:415)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hive.hcatalog.mapreduce.HCatInputFormat not found
	at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
	at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
	... 8 more

解决方法
http://stackoverflow.com/questions/34449561/hadoop-map-reduce-job-class-org-apache-hive-hcatalog-mapreduce-hcatinputformat

The issue here is Kylin assumes the same Hive jars on all Hadoop nodes. And when certain node missing the Hive jars (or even in different location), you get the ClassNotFoundException on HCatInputFormat.

Btw, you should be able to get a clear error message from Yarn job console. This is a met issue.

Deploying Hive to all cluster nodes can surely fix the problem, like you have tried.

Or another (cleaner) workaround is manually configure Kylin to submit Hive jars as additional job dependencies. See https://issues.apache.org/jira/browse/KYLIN-1021

Finally there's also a open JIRA suggests that Kylin should submit Hive jars by default. See https://issues.apache.org/jira/browse/KYLIN-1082

3.

org.apache.kylin.job.exception.ExecuteException: org.apache.kylin.job.exception.ExecuteException: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:111)
        at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:130)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kylin.job.exception.ExecuteException: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:111)
        at org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:51)
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        ... 4 more
Caused by: java.lang.NoSuchMethodError: org.apache.hadoop.yarn.conf.YarnConfiguration.getServiceAddressConfKeys(Lorg/apache/hadoop/conf/Configuration;)Ljava/util/List;
        at org.apache.hadoop.yarn.conf.HAUtil.getConfKeyForRMInstance(HAUtil.java:239)
        at org.apache.hadoop.yarn.conf.HAUtil.getConfValueForRMInstance(HAUtil.java:250)
        at org.apache.hadoop.yarn.conf.HAUtil.getConfValueForRMInstance(HAUtil.java:262)
        at org.apache.kylin.job.common.MapReduceExecutable.getRestStatusCheckUrl(MapReduceExecutable.java:191)
        at org.apache.kylin.job.common.MapReduceExecutable.doWork(MapReduceExecutable.java:135)
        at org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:107)
        ... 6 more

拷贝新版本的hadoop-yarn-api.*.jar 至hbase/lib  hbase版本不对

4.

kylin 报错: org.apache.hadoop.hbase.TableNotFoundException: Table KYLIN_* is not currently available.
           Load HFile to HBase Table failed

查看hbase log,看了下是snappy的问题,把kylin的压缩方式改为gzip,重启好了,可以顺利建cube了

5.提示 kylin_metadata 已存在 不能在hbase中创建表,然而hbase中又看不到该表

删除zookeeper 上hbase下该表node

./zkClient.sh

rmr /hbase-unsecure/table/kylin_metadata

Hive/Beeline 使用笔记

Hive:

利用squirrel-sql 连接hive

add driver -> name&example url(jdbc:hive2://xxx:10000)->extra class path ->Add

{hive/lib/hive-common-*.jar

hive/lib/hive-contrib-*.jar

hive/lib/hive-jdbc-*.jar

hive/lib/libthrift-*.jar

hive/lib/hive-service-*.jar

hive/lib/httpclient-*.jar

hive/lib/httpcore-*.jar

hadoop/share/hadoop/common/hadoop-common–*.jar

hadoop/share/hadoop/common/lib/common-configuration-*.jar

hadoop/share/hadoop/common/lib/log4j-*.jar

hadoop/share/hadoop/common/lib/slf4j-api-*.jar

hadoop/share/hadoop/common/lib/slf4j-log4j-*.jar}

->List Drivers(wait ..then class name will auto set org.apache.hive.jdbc/HiveDriver)->OK->Add aliases ->chose the hive driver->done

 

Hive数据迁移

1.导出表

EXPORT TABLE <table_name> TO 'path/to/hdfs';

2.复制数据到另一个hdfs

hadoop distcp hdfs://:8020/path/to/hdfs hdfs:///path/to/hdfs

3.导入表

IMPORT TABLE <table_name> FROM 'path/to/another/hdfs';

Hive 输出查询结果到文件

输出到本地文件:

insert overwrite local directory ‘./test-04’
row format delimited
FIELDS TERMINATED BY ‘\t’
COLLECTION ITEMS TERMINATED BY ‘,’
MAP KEYS TERMINATED BY ‘:’
select * from src;

输出到hdfs:

输出到hdfs好像不支持 row format,只能另辟蹊径了

INSERT OVERWRITE DIRECTORY ‘/outputable.txt’
select concat(col1, ‘,’, col2, ‘,’, col3) from myoutputtable;

当然默认的分隔符是\001

若要直接对文件进行操作课直接用stdin的形式

eg. hadoop fs -cat ../000000_0 |python doSomeThing.py

#!/usr/bin/env python

import sys

for line in sys.stdin:

    (a,b,c)=line.strip().split(‘\001’)

 

Hive 语法:

hive好像不支持select dicstinct col1 as col1 from table group by col1

需要用grouping sets

select col1 as col1 from table group by col1 grouping sets((col1))

Beeline:

文档:https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients

利用jdbc连接hive:

hive2=’JAVA_HOME=/opt/java7 HADOOP_HOME=/opt/hadoop /opt/hive/bin/beeline -u jdbc:hive2://n1.hd2.host.dxy:10000 -n hadoop -p fake -d org.apache.hive.jdbc.HiveDriver –color=true –silent=false –fastConnect=false –verbose=true’

beeline利用jdbc连接hive若需要执行多条命令使用

hive2 -e “xxx” -e “yyy”  -e…

Hadoop2.5.1使用笔记

只是记录遇到的问题及解决方法:

1.安装教程:http://dogdogfish.com/2014/04/26/installing-hadoop-2-4-on-ubuntu-14-04/

配置hadoop遇到的问题:

执行hadoop version   
localhost: Error: JAVA_HOME is not set and could not be found.

直接命令行执行export JAVA_HOME=/PATH/TO/JDK无法解决问题

grep命令发现libexec/hadoop-config.sh文件中有错误提示中的黑体字,于是在对应位置前直接设置export JAVA_HOME=/PATH/TO/JDK

错误消失。
执行start-dfs.sh   
Error: Could not find or load main class org.apache.hadoop.util.VersionInfo

一般是新版本没有包含相应的jar包,检查是否缺少 hadoop/share/hadoop/*(common...) 缺少则下载旧版本里的包放入,检查HADOOP_HOME,HADOOP_COMMON_HOME,HADOOP_PREFIX 环境变量是否有配置

2.新版本的hadoop streaming 库地址:$HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.5.1.jar

3.使用hadoop streaming 同一键值需要自己合并

4.Hadoop Core(hadoop-core-*.jar) is renamed Hadoop Common. location:$HADOOP_HOME/share/hadoop/common/hadoop-common-2.5.1.jar

报错:

1.java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

    解决方案:程序有错, 检查 http://localhost:50070/logs/userlogs/下的日志看container日志中的stderr,然后具体问题具体解决