Hive QL查询
使用HiveQL查询映射到Hive表的数据集。 Hive中的SELECT基础语法和标准SQL语法基本一致,支持WHERE、DISTINCT、GROUP BY、ORDER BY、HAVING、LIMIT、子查询等。
HiveQL查询语法如下:
SELECT [ALL | DISTINCT] select_expr, select_expr, ...
FROM table_reference
[WHERE where_condition]
[GROUP BY col_list]
[ORDER BY col_list]
[CLUSTER BY col_list
| [DISTRIBUTE BY col_list] [SORT BY col_list]
]
[LIMIT [offset,] rows]
select...from语句
select是SQL的投影算子,from子句标识了从哪个表、视图或嵌套查询中选择记录。
例:执行简单的查询:
select * from employees;
说明:
在Hive中查询语句往往都要被解析成MapReduce的job进行计算,但是有两个查询语句是不走MapReduce的:
- 1) select * from employees;
- 2) select * from employees limit 2;
注:在MYSQL中limit是取前几条记录,但是在Hive中,limit是抽样,会随机返回对应的记录数。
另外,“SELECT *”扫描整个表/文件,不会触发MapReduce jobs,因此它比“select 列名”要运行得快。
一、数据准备
1、启动HDFS和YARN集群:
$ start-dfs.sh $ start-yarn.sh
2、上传特价检测的数据文件pricewatch.csv:
$ hdfs dfs -mkdir -p /hive_data/p $ hdfs dfs -put pricewatch.csv /hive_data/p/
启动Hive CLI:
$ hive
创建一个Hive外部表pricewatch,并加载数据文件:
-- 创建数据库
create database xueai8;
-- 查看当前有哪些数据库
show databases;
-- 使用数据库
use xueai8;
-- 创建外部表,并忽略第一行(标题行)
create external table pricewatch(
recordid string,
jclb string,
jcmc string,
bqjg decimal(10,3),
sqjg decimal(10,3),
tb decimal(10,3),
hb decimal(10,3)
)
row format delimited
fields terminated by ','
location '/hive_data/p'
TBLPROPERTIES ('serialization.null.format' = '','skip.header.line.count' = '1');
show tables;
-- 查看完整建表语句
show create table pricewatch;
二、执行简单查询
下面是对pricewatch执行简单条件查询的示例。
-- limit:限制返回记录的数量
-- 查询前10条数据:
select * from pricewatch limit 10;
select * from pricewatch limit 10,5;
-- 条件查询
-- 单条件
-- 查询出所有本期价格高于5.00元的商品
select * from pricewatch where bqjg > 5.00;
-- 多条件
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00;
select * from pricewatch where bqjg <= 5.00 and bqjg >= 8.00;
-- 对查询结果排序
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg;
select * from pricewatch where bqjg > 5.00 and bqjg < 8.00 order by bqjg desc;
-- null值比较: is null和is not null
select * from pricewatch where jclb is not null;
select * from pricewatch where jclb is not null limit 5;
-- distinct:去重
-- 查看有哪些检测类别
select distinct jclb from pricewatch;
-- in 和 not in运算符
-- 找出检测名称为“生菜”、“芹菜”和“菠菜”的商品
select * from pricewatch where jcmc in ('生菜','芹菜','菠菜');
-- 找出检测名称不是“生菜”、“芹菜”和“菠菜”的商品
select * from pricewatch where jcmc not in ('生菜','芹菜','菠菜');
-- between...and...和not between...and...
-- 找出本期价格在2块~5块之间的商品
select * from pricewatch where bqjg between 2 and 5;
-- 找出本期价格低于2块或高于5块的商品
select * from pricewatch where bqjg not between 2 and 5;
-- 找出价格环比上涨的商品
select * from pricewatch where bqjg>sqjg;
select * from pricewatch where hb>0;
三、执行模糊查询
下面是对pricewatch执行模糊查询的示例。
通过使用like、not like和rlike关键字实现模糊匹配查询。它们之间的区别如下:
- 1) like和not like支持两个简单匹配符号 _和%,其中"_"表示任意单个字符,字符"%"表示任意数量的任意字符。
- 2) rlike支持标准的Java正则表达式符号。
-- 找出所有检测名称以'米'字结尾的商品
select * from pricewatch where jcmc like '%米';
select * from pricewatch where jcmc rlike '米$';
-- 找出所有检测名称以'泰'字开头的商品
select * from pricewatch where jcmc like '泰%';
select * from pricewatch where jcmc rlike '^泰';
-- 找出所有检测名称以'花'字开头、以'油'字结尾的商品
select * from pricewatch where jcmc like '花%油';
select * from pricewatch where jcmc rlike '^花.*油$';
select * from pricewatch where jcmc not like '花%油';
-- 找出所有检测名称以'米'或'油'字结尾的商品
select * from pricewatch where jcmc rlike '[米油]$';
-- 找出所有检测名称为"xxx菜"的商品
select * from pricewatch where jcmc rlike '^.{1}菜$';
-- 找出价格高于8元的大米
select * from pricewatch where bqjg>8 and jcmc like '%米';
select * from pricewatch where bqjg<3.5 and jcmc like '%米';
四、使用聚合函数和分组聚合查询
下面的示例在查询中使用了聚合函数。
-- 找出最高价格(最低价格、平均价格) select max(bqjg) as max_price from pricewatch; select avg(bqjg) as avg_price from pricewatch; select min(bqjg) as min_price from pricewatch; -- 统计共有多少个检测商品 select count(*) from pricewatch; select count(*) as total from pricewatch where jclb is not null; -- 统计每个检测类别有多少个检测商品 select jclb,count(*) from pricewatch group by jclb; select jclb,count(*) as total from pricewatch group by jclb order by total desc; -- 统计每种被检测商品类别的均价 select jclb,avg(bqjg) as avg_price from pricewatch group by jclb; select distinct jclb as jc from pricewatch limit 5; select jclb as `检测类别` from pricewatch limit 1; -- 统计均价高于10元的商品类别 select jclb, avg(bqjg) from pricewatch group by jclb having avg(bqjg)>10; select jclb, round(avg(bqjg),2) from pricewatch group by jclb having avg(bqjg)>10;
五、case when语句
根据原始字段的值,输出转换后的新值。常用在机器学习数据打标签上。
create table if not exists price_label as
select jcmc,
case
when bqjg>25 then 2
when bqjg<10 then 0
else 1
end as label
from pricewatch;
select * from price_label limit 10;
六、子查询
Hive QL也支持子查询。
select p.* from pricewatch p where p.bqjg in (select max(d.bqjg) from pricewatch d);