技术库 > 网站架构

hive 通过hdfs加载数据建表

技术库:tec.5lulu.com

1 描述

from:tec.5lulu.com

Hive表pms.cross_sale_path建立以日期作为分区,将hdfs目录/user/pms/workspace/ouyangyewei/testUsertrack/job1Output/crossSale上的数据,写入该表的$yesterday分区上表结构:

hive -e "  
set mapred.job.queue.name=pms;  
  
drop table if exists pms.cross_sale_path;  
create external table pms.cross_sale_path  
(  
track_id string,  
track_time string,  
session_id string,  
gu_id string,  
end_user_id string,  
page_category_id bigint,  
algorithm_id int,  
is_add_cart int,  
rcmd_product_id bigint,  
product_id bigint,  
category_id bigint,  
path_id string,  
path_type int,  
path_length int,  
path_list string,  
order_code string  
)  
PARTITIONED BY (ds string)  
ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'  
LINES TERMINATED BY 'n'  
STORED AS TEXTFILE;"  

写法:

yesterday=`date -d -1days +%Y-%m-%d`  

hive -e "load data inpath '/user/pms/workspace/ouyangyewei/testUsertrack/job1Output/crossSale' into table pms.cross_sale_path partition(ds='$yesterday');" 

2 外部表分区加载数据

外部表也是一种表就可以,普通表有分区,外部表也是有分区的。所以如果是基于分区表创建的外部表一定要对外部表执行ALTER TABLE table_name ADD PARTITION。否则是根本访问不到数据的。

例子应该会更直观:

原始的表:

hive> desc shaka_test_external;

OK

id bigint

gender string(分区字段)

基于上表的数据创建外部表:

create external table ext_shaka_test_external

like shaka_test_external

location 'hdfs://hdpnn:9000/group/hive/shaka_test_external';

如果此时直接访问数据,那么是没有记录的:

hive> show partitions ext_shaka_test_external;

hive> select * from ext_shaka_test_external;

OK

Time taken: 0.229 seconds


陷阱来了,添加分区后就能访问了:

hive> alter table ext_shaka_test_external add partition (gender='0');

OK

Time taken: 0.18 seconds

hive> alter table ext_shaka_test_external add partition (gender='1');

OK

Time taken: 1.305 seconds

hive> select * from ext_shaka_test_external;

OK

0 0

1 1

2 1

Time taken: 0.251 seconds

hive> select * from ext_shaka_test_external where gender = '0';

OK

0 0

Time taken: 0.175 seconds 

hive sql对hdfs的操作最终都会转化为mr任务,下面介绍如何将已经存在的hdfs文件“导入”hive表

 条件及要求:

1)hdfs文件为经过lzo压缩的seqFile

2)seqFile数据样例

    127.0.0.1      ae8186613401633040361974|1340163304040|0.1|hsf.qa.test|yuanfeng|test_07_01
     其中key是hostIp, value为一条日志,格式为:traceid|time|rpcid|appname|querykey|msg

3)hive表是外在分区表 

步骤1. 建立hive表


CREATE EXTERNAL TABLE biz_eagleeye (traceId STRING, time STRING, rpcId STRING, 
appName STRING, queryKey STRING, msg STRING, kvMap STRING)
PARTITIONED BY(pt STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS SEQUENCEFILE; 


其中EXTERNAL和PARTITIONED关键字指明表为外表和分区表,STORED AS SEQUENCEFILE是专门指定加载seqFile数据源的,如果是普通文本可换成TEXTFILE 

步骤2.从hdfs加载数据


ALTER TABLE biz_eagleeye ADD PARTITION (pt='2013-02-28') LOCATION '/group/tlog/zhiyuan'; 


通过LOCATION关键字给出hdfs文件路径,并给出分区值。特别说明下,加载seqFile时hive默认过滤掉key(将key看做null)然后按指明的分隔符(这里是’|‘)对value进行切分,如果需要考虑key或较复杂的切分字段方式可以指定自定义的mapper和reducer:

mapred.mapper.class = org.apache.hadoop.hive.ql.exec.ExecMapper

hive.input.format  = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
mapred.mapper.class = org.apache.hadoop.hive.ql.exec.ExecMapper
hive.input.format  = org.apache.hadoop.hive.ql.io.CombineHiveInputFormat 
 
步骤3.检验是否加载成功
select * from biz_eagleeye where pt='2013-02-28' limit 1000

hive 通过hdfs加载数据建表,by 5lulu.com
 

如果需要每天加载一个分区(对应hdfs路径下的昨日的数据文件夹),可以通过脚本建立一个crontab定时任务自动完成

3 HDFS加载数据的分区表实例


hdfs dfs -cp /user/hive/warehouse/fnsearch.db/fn_product_info_source/ hdfs://10.201.194.131:8020//user/hive/warehouse/fnsearch.db/

/opt/hive/bin/hive -e "CREATE external TABLE IF NOT EXISTS fnsearch.fn_product_info_source (sm_seq string, itno string,sm_soldqty string,it_spec_seq string,sup_seq string,cp_seq string,brand_seq string) PARTITIONED BY (dt string) location '/user/hive/warehouse/fnsearch.db/fn_product_info_source';"

/opt/hive/bin/hive -e " load data  inpath '/user/hive/warehouse/fnsearch.db/fn_product_info_source/dt=main/000000_0' overwrite into table fnsearch.fn_product_info_source partition (dt='main')" 


4 建表的时候要指明分隔符 

hive建表时默认的分隔符是'01',若在建表的时候没有指明分隔符,load文件的时候文件的分隔符需要是'01'的,

若文件分隔符不是'01',程序不会报错,但表查询的结果会全部为'NULL',

如何制作分隔符为'01'的测试文件

用vi编辑器Ctrl+v然后Ctrl+a就可以通过键盘输入'01'

也可以在建表的时候指明分隔符为制表符,然后测试文件用excel制表符制作,

例如:

create table pokes(foo INT,bar STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LINES TERMINATED BY 'n' STORED AS TEXTFILE; 

LOAD DATA local inpath '/root/pokes.txt' INTO TABLE pokes;其中 pokes.txt文件需要用excel制作制表符为分隔符,否则程序不会报错,但表查询的结果会全部为'NULL'

5 替换分隔符

若待导入的文件的分隔符与表的分隔符不一致,或者需要替换hive导出文件字段默认分隔符:

Hive建表的时候虽然可以指定字段分隔符,不过用insert overwrite local directory这种方式导出文件时,字段的分割符会被默认置为01,一般都需要将字段分隔符转换为其它字符,可以使用下面的命令:

     sed -e 's/x01/|/g' file 

可以将|替换成自己需要的分隔符,file为hive导出的文件。

例如:

 hive 通过hdfs加载数据建表,by 5lulu.com

复杂类型的数据表,这里列之间以't'分割,数组元素之间以','分割 


#数据文件内容如下
 1 huangfengxiao   beijing,shanghai,tianjin,hangzhou
 2 linan   changchu,chengdu,wuhan
 
 hive> create table complex(name string,work_locations array)
     > ROW FORMAT DELIMITED
     > FIELDS TERMINATED BY 't'
     > COLLECTION ITEMS TERMINATED BY ',';
 hive> describe complex;
 OK
 name    string
 work_locations  array
 hive> LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/complex.txt' OVERWRITE INTO TABLE complex
 hive> select * from complex;                                                                
 OK
 huangfengxiao   ["beijing","shanghai","tianjin","hangzhou"]
 linan   ["changchu","chengdu","wuhan"]
 Time taken: 0.125 seconds
 hive> select name, work_locations[0] from complex;
 MapReduce Total cumulative CPU time: 790 msec
 Ended Job = job_201301211420_0012
 MapReduce Jobs Launched: 
 Job 0: Map: 1   Cumulative CPU: 0.79 sec   HDFS Read: 296 HDFS Write: 37 SUCCESS
 Total MapReduce CPU Time Spent: 790 msec
 OK
 huangfengxiao   beijing
 linan   changchu
 Time taken: 20.703 seconds


6 如何分区?

 表class(teacher sting,student string,age int)

 Mis li huangfengxiao 20

 Mis li lijie 21

 Mis li dongdong 21

 Mis li liqiang 21

 Mis li hemeng 21

 Mr xu dingding 19

 Mr xu wangqiang 19

 Mr xu lidong 19

 Mr xu hexing 19

 如果我们将这个班级成员的数据按teacher来分区

 create table classmem(student string,age int) partitioned by(teacher string)

 分区文件

 classmem_Misli.txt

  huangfengxiao 20  

  lijie 21          

  dongdong 21  

  liqiang 21          

  hemeng 21 

 classmem_MrXu.txt

  dingding 19 

  wangqiang 19 

  lidong 19         

  hexing 19   

 LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_Misli.txt' INTO TABLE classmem partition (teacher = 'Mis.li')

 LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/classmem_MrXu.txt' INTO TABLE classmem partition (teacher = 'Mis.Xu')

  #分区列被默认到最后一列

 hive> select * from classmem where teacher = 'Mr.Xu';

 OK

 dingding        19      NULL    Mr.Xu

 wangqiang       19      NULL    Mr.Xu

 lidong  19              NULL    Mr.Xu

 hexing  19      NULL    Mr.Xu

 Time taken: 0.196 seconds

 #直接从分区检索,加速;如果where子句的条件不是分区列,那么,这个sql将被编译成mapreduce程序,延时很大。

 #所以,我们建立分区,是为了一些常用的筛选查询字段而用的。

五桶的使用?更高效!可取样!主要用于大数据集的取样

 桶的原理是对一个表(或者分区)进行切片,选择被切片的字段,设定桶的个数,用字段与个数的hash值进行入桶。

 比如bucket.txt数据文件内容如下:

 id name age

 1 huang 11

 2 li 11

 3 xu 12

 4 zhong 14

 5 hu 15

 6 liqiang 17

 7 zhonghua 19

 如果我们想将这个数据表切成3个桶,切片字段为id

 那么用id字段hash后,3个桶的内容如下:

 桶id hash 3 =0

 3 xu 12

 6 liqiang 17

 桶id hash 3 =1

 1 huang 11

 4 zhong 14

 7 zhonghua 19

 桶id hash 3 =2

 2 li 11

 5 hu 15

 这个过程的创建表语句如下: 

 create table bucketmem (id int,name string,age int) CLUSTERED BY (id) sorted by (id asc) into 3 buckets
 ROW FORMAT DELIMITED FIELDS TERMINATED BY 't';
 LOAD DATA LOCAL INPATH '/home/hadoop/hfxdoc/bucketmem.txt' INTO TABLE bucketmem;
 select * from bucketmem tablesample(bucket 1 out of 4)


7 其他操作参考

更完整的请参考官网: https://cwiki.apache.org/confluence/display/Hive/Home

 1) 创建与已知表相同结构的表Like:

 只复制表的结构,而不复制表的内容。

 create table test_like_table like test_bucket;

 2) 对表进行重命名 rename to:

 ALTER TABLE table_name RENAME TO new_table_name

 3) 增加分区 Add Partitions:

 ALTER TABLE table_name ADD partition_spec [ LOCATION 'location1' ]partition_spec [ LOCATION 'location2' ]

 4) 对表中的某一列进行修改,包括列的名称/列的数据类型/列的位置/列的注释

 ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type[COMMENT col_comment] [FIRST|AFTER column_name]

 5) 添加/替换列Add/ReplaceColumns

 ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENTcol_comment], ...)

 ADD COLUMNS 允许用户在当前列的末尾增加新的列,但是在分区列之前。

hive 通过hdfs加载数据建表


本文链接 http://tec.5lulu.com/detail/105dsn2i5z6a18sa8.html

我来评分 :7.7
3

转载注明:转自5lulu技术库

本站遵循:署名-非商业性使用-禁止演绎 3.0 共享协议

www.5lulu.com

  • hdfs dfs -cp /user/hive/warehouse/fnsearch.db/fn_product_info_source/ hdfs://10.201.194.131:8020//user/hive/warehouse/fnsearch.db/ /opt/hive/bin/hive -e "CREATE external TABLE IF NOT EXISTS fnsearch.fn_product_info_source (sm_seq string, itno string,sm_soldqty string,it_spec_seq string,sup_seq string,cp_seq string,brand_seq string) PARTITIONED BY (dt string) location '/user/hive/warehouse/fnsearch.db/fn_product_info_source';" /opt/hive/bin/hive -e " load data inpath '/user/hive/warehouse/fnsearch.db/fn_p

  • /opt/hive/bin/hive -e " load data inpath '/user/hive/warehouse/fnsearch.db/fn_product_info_source/dt=main/000000_0' overwrite into table fnsearch.fn_product_info_source partition (dt='main')"