您当前的位置: 首页 >  大数据

段智华

暂无认证

  • 3浏览

    0关注

    1232博文

    0收益

  • 0浏览

    0点赞

    0打赏

    0留言

私信
关注
热门博文

大数据Spark “蘑菇云”行动第90课:Hive中Join电影店铺系统案例和性能优化、Index和Bucket案例实战

段智华 发布时间:2016-12-01 21:42:04 ,浏览量:3

大数据Spark “蘑菇云”行动第90课:Hive中Join电影店铺系统案例和性能优化、Index和Bucket案例实战 电影数据集下载 http://grouplens.org/datasets/movielens/ http://grouplens.org/datasets/movielens/1m/ http://files.grouplens.org/datasets/movielens/ml-1m-README.txt http://files.grouplens.org/datasets/movielens/ml-1m.zip 1,hive的map join 小表 -> hashtables ->广播到 map端 ->大表做一行行的匹配 2,semi join  传递表的key传过去, 三个表的数据结构 RATINGS FILE DESCRIPTION     UserID::MovieID::Rating::Timestamp USERS FILE DESCRIPTION       UserID::Gender::Age::Occupation::Zip-code MOVIES FILE DESCRIPTION      MovieID::Title::Genres hive> create database hivestudy show databases; use hivestudy; show tables; 建立用户表://不支持LONG CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zip-code String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101'); 建立评分表: CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings; 建立电影表 CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies; DROP TABLE users; CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/users.dat' INTO TABLE users PARTITION(dt='20161101'); CREATE TABLE ratings(UserID BigInt, MovieID BigInt, Rating Double, Timestamped String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/ratings.dat' INTO TABLE ratings; CREATE TABLE movies(MovieID BigInt, Title String, Genres String) ROW FORMAT DELIMITED FIELDS TERMINATED BY '::'; LOAD DATA LOCAL INPATH '/root/Documents/data/moviesData/movies.dat' INTO TABLE movies; hive>select count(1) from ratings; RATINGS     UserID::MovieID::Rating::Timestamp USERS       UserID::Gender::Age::Occupation::Zip-code MOVIES      MovieID::Title::Genres   hive>  SELECT  users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ; 空结果,原因是数据没有导入进来 select * from users limit 10; //null 删除表,重来, drop tables  users; 转义字符,转不了 CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '\:\:'; 用ascii编码: CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '\u3A\u3A'; CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '\\58'; 数据清洗换个办法:将::用\t 换表符替换掉! 现在好了  CREATE TABLE users(UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String) PARTITIONED BY (dt String) ROW FORMAT DELIMITED FIELDS  TERMINATED BY '\t'; 重新测试join RATINGS     UserID::MovieID::Rating::Timestamp USERS       UserID::Gender::Age::Occupation::Zip-code MOVIES      MovieID::Title::Genres   hive>  SELECT  users.UserID, users.Age, users.Gender FROM ratings JOIN users ON (ratings.UserID = users.UserID) WHERE ratins.MovieID = 2116 ; hive> SELECT  users.UserID, users.Age, users.Gender,movies.title,movies.gendre  FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies  ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ; HIVE很诡异的一个用法,map端 速度快一点

hive>SELECT  /*+MAP JOIN(MOVIES)*/ users.UserID, users.Age, users.Gender,movies.title,movies.gendre  FROM ratings JOIN users ON (ratings.UserID = users.UserID) JOIN movies ON (rating.movie.id = movies .movid) WHERE ratins.MovieID = 2116 ;

MAPJION会把小表全部读入内存中,在map阶段直接拿另外一个表的数据和内存中表数据做匹配,由于在map是进行了join操作,省去了reduce运行的效率也会高很多 这样就不会由于数据倾斜导致某个reduce上落数据太多而失败。于是原来的sql可以通过使用hint的方式指定join时使用mapjoin。

关注
打赏
1659361485
查看更多评论
立即登录/注册

微信扫码登录

0.2548s