博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
hive的优化方式
阅读量:2395 次
发布时间:2019-05-10

本文共 4239 字,大约阅读时间需要 14 分钟。

hive相关优化方式:

Column Pruning

As name suggests -discard columns which are not needed

> select a,b from t where e<10;

>t contains 5 columns (a,b,c,d,e)

Columns c,d are discarded

Select only the relevant columns

Enabled by defaut

  > hive.optimize.cp=true

Predicate Pushdown

Move predicate closer to the table scan only.

Enabled by default:

hive.optimize.ppd=true

Predicates moved up across joins.

select * from t1 join t2 on(t1.c1=t2.c2 and t1.c1<10)

select * from t1 join t2 on(t1.c1=t2.c2) where t1.c1<10

Special needs for outer joins:

Left outer join: predicates on the left side aliases are pushed

Right outer join:predicates on the right side aliases are pushed

Full outer join:none of the predicates are pushed

Non-deterministic functions(eg.rand()) not pushed.

Use annotation:

> #UDFType(deterministic=false)

The entire expression containing non-deterministic function is not pushed up

> c1>10 and c2<rand()

Partition Pruning

Reduce list of partitions to be scanned

Works on parse tree currently- some known bugs

select * from 

     (select c1, count(1) from t group by c1) subq

where subq.prtn=100;

select * from t1 join

     (select * from t2) subq on(t1.c1=subq.c2)

where subq.prtn=100;

hive.mapred.mode=nonstrict

Strict mode, scan of a complete partitioned table fails.

Hive QL-Join

INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view pv

JOIN user u

ON (pv.userid=u.userid);

Rightmost table streamed - whereas inner tables data is 

kept in memory for a given key. Use largest table as the 

right most table.

hive.mapred.mode=nonstrict

In strict mode,Cartesian product not allowed

INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view p JOIN user u

ON (pv.userid=u.useid)

JOIN new user x on (u.userid=x.useid);

Same join key - merge into 1 map-reduce job -true for any number of tables with the same join key.

1 map-reduce job instead of 'n'

The merging happens for OUTER joins also

INSERT OVERWRITE TABLE pv_users

SELECT pv.pageid,u.age

FROM page_view p JOIN user u

    ON (pv.userid=u.userid)

     JOIN new user x on (u.age=x.age);

Different join keys - 2 map-reduce jobs

Same as:

INSERT OVERWRITE TABLE tmptable SELECT *

FROM page_view p JOIN user u

  ON(pv.useid=u.userid);

INSERT OVERWRITE TABLE pv_users

SELECT x.pageid,x.age

FROM temptable x JOIN newuser y on(x.age=y.age);

Join Optimizations

Map Joins

 > User specified small tables stored in hash tables on the mapper backed by jdbm

 > No reducer needed

INSERT INTO TABLE pv_users

SELECT /*+MAPJOIN(pv) */ pv.pageid,u.age 

FROM page_view pv JOIN user u

ON(pv.userid=u.userid);

Map Join

>>Optimization phase

>> n-way map-join if(n-1) tables are map side readable

>>Mapper reads all (n-1) table before processing the main table under consideration

>>Map-side readable tables are cached in memory and backed by JDBM persistent hash tables

Parameters

>>hive.join.emit.interval=1000

>>hive.mapjoin.size.key=10000

>>hive.mapjoin.cache.numrows=10000

Future

>>Sizes/statistics to determine join order

>>Sizes to enforce map-join

>>Better techniques for handling skew for a given key

>>Using sorted properties of the table

>>Fragmented joins

>>n-way joins for different join keys by replicating data

Hive QL - Group By

SELECT pageid, age, count(1) 

FROM pv_users

GROUP BY paged,age;

Group by Optimizations

>>Map side partial aggregations

     > Hash-based aggregates

     > Serialized key/values in hash tables

     > 90% speed improvement on Query 

         select count(1) from t;

>> Load balancing for data skew

Parameters

>> hive.map.aggr=true

>> hive.groupby.skewindata=false

>> hive.groupby.mapaggr.checkinterval=100000

>> hive.map.aggr.hash.percentmemory=0.5

>> hive.map.aggr.hash.min.reduction=0.5

Multi GroupBy

FROM pv_users

        INSERT OVERWRITE TABLE pv_gender_sum

             SELECT gender, count(DISTINCT userid),count(userid)

                   GROUP BY gender

        INSERT OVERWRITE TABLE pv_age_sum

              SELECT age, count(DISTINCT userid)

                   GROUP BY age

>>n+1 map-reduce jobs instead of 2n

>> Single scan of input table

>>Same distinct key across all groupies

>>Always user multi-groupby

Merging of small files

>>Lots of small files creates problems for downstream jobs

     > SELECT * FROM T WHERE x<10;

>> hive.merge.mapfiles=true

>> hive.merge.mapredfiles=false

>> hive.merge.size.per.task=256*1000*1000

>> Increases time for current query

转载地址:http://ovzob.baihongyu.com/

你可能感兴趣的文章
SANS FOR572 Logstash
查看>>
FreeBSD kernel NFS client local vulnerabilities
查看>>
OpenGL坐标系
查看>>
VS2008快捷键大全
查看>>
poj 2140 Herd Sums
查看>>
poj 2524 Ubiquitous Religions
查看>>
poj 1611 The Suspects
查看>>
poj 3331 The Idiot of the Year Contest!
查看>>
poj 3233 Matrix Power Series
查看>>
poj 3070 Fibonacci
查看>>
poj 1656 Counting Black
查看>>
BestCoder Round #28
查看>>
poj3299 Humidex
查看>>
poj2159 Ancient Cipher
查看>>
poj1083 Moving Tables
查看>>
poj2255 Tree Recovery
查看>>
zoj 1745 Are We There Yet?
查看>>
UVA100 The 3n + 1 problem
查看>>
hdu1754 I Hate It
查看>>
hdu 1166 敌兵布阵(求区间的和,单节点更新)
查看>>