本课程共七个章节,课程地址:7周成为数据分析师(完结)_哔哩哔哩_bilibili
- 数据分析思维
- 业务知识
- Excel
- 数据可视化
- SQL
- 统计学
- Python
- 数据库(MySQL)
- Join
- SQL练习题
目录
第五周:SQL(P61-P76)
一、数据库(MySQL)
(一)下载与安装
(二)数据库基础理论
(三)MySQL Workbench的基本使用
(四)SQL基本语句(对单张表)
二、Join(多表)
三、SQL练习题(LeetCode)
175题:Combine two tables
176题:Second Highest Salary
184题:Department Highest Salary
四、SQL练习题(orderinfo 和 userinfo)
建表
如何在命令行里导入数据
时间操作
练习题
五、Power BI连接数据库
获取数据 - MySQL数据库 - 连接
也可以通过SQL语句来导入
抽样
一、数据库(MySQL)数据库管理软件:MySQL
(一)下载与安装1. 怎么查看自己下载的MySQL版本?
- 进入MySQL 8.0 Command Line Client
2. 单独下载MySQL Workbench(注意要与MySQL版本相对应):
MySQL :: Download MySQL Workbench (Archived Versions)
界面如下:
(二)数据库基础理论
- 数据库是数据存储的集合
- 表是数据结构化的信息
- 列存储表中的组织信息
- 行存储表中的明细记录
- 主键是表中的唯一标示,不具备业务意义
主键:
- 表的主键不做强制要求,但建议设立
- 主键值必须唯一
- 每行必须有一个主键,不可为空
- 主键值不可被修改
- 主键值被删除后不可重用
- 表A的主键,可以作为表B的字段,此时不受约束
类型:
- char(文本、富文本、可变长度)
- int
- float
- date
- 布尔值
- timestamp(精确到秒)
(三)MySQL Workbench的基本使用
建schema:
Apply即可
建表:
Apply即可
加载数据:
导入company_sql.csv
同理新建表 dataAnalyst 并加载数据,注意新建表时要将 positionAdvantage(varchar(100))和 positionLabels(varchar(250))的长度设置的大一些,否则会报错
(四)SQL基本语句(对单张表)
SQL语句的结尾一定要加上分号,表示一个命令(可能对应几条SQL语句)的结束
# 升序排列 order by companyId # 降序排列 order by companyId desc
过滤数据(where):
select * from data.dataanalyst where companyId = 4184; # 还可以 > < between...and... select * from data.dataanalyst where city = "杭州";
不等于:
- !=
- <>
- not in (... , ...)
多条件查找(and/or):
- and(比 or 优先)
- or
例:城市为上海,可以是学历为本科,或工作年限满足1-3年
select * from data.dataanalyst where city="上海" and (education="本科" or workYear="1-3年")
模糊查找/包含某个词(like):
select * from data.dataanalyst where secondType like '%开发%' # 包含开发这个词
分组(group by),类似于 Excel 里的数据透视表:
例:查询不同城市下招聘的职位数量
select city,count(positionId) from data.dataanalyst # count(1)或count(*)结果一样 group by city
例:查询不同城市下招聘的公司数量(一个公司可能在一个城市同时招很多岗位,需要去重)
select city, count(positionId), count(distinct companyId) from data.dataanalyst group by city
例:想知道每个城市下不同学历的岗位的数量
select city,education,count(1) from data.dataanalyst group by city,education
例:只需要每个城市中岗位数在100以上的数据
针对分组(group by)后的结果进行过滤,不用where,用having
having后面还可以加 and 连接多个条件,实现多重过滤
select city,count(1) from data.dataanalyst group by city having count(positionId)>=100
例:找出不同城市电子商务岗位数量在50以上的城市(哪些城市电子商务岗位招了50个以上)
select city,count(1) from data.dataanalyst where industryField like '%电子商务%' group by city having count(positionId)>=50
嵌套函数:
select city,count(1) from data.dataanalyst group by city having count(if(industryField like "%电子商务%", 1 ,null))>=50
count() 对 null 是不计数的
例:不同城市下,电子商务岗位占这个城市招聘总数的比例?
select city, count(1), count(if(industryField like '%电子商务%',industryField , null)), count(if(industryField like '%电子商务%',industryField , null))/count(1) from data.dataanalyst group by city
在上例的基础上,去掉“电子商务”岗位数量小于10的城市:
select city, count(1), count(if(industryField like '%电子商务%',industryField , null)), count(if(industryField like '%电子商务%',industryField , null))/count(1) from data.dataanalyst group by city having count(if(industryField like '%电子商务%',industryField , null)) >= 10 order by count(if(industryField like '%电子商务%',industryField , null))
把字段进行别名:
select city, count(1) as total, count(if(industryField like '%电子商务%',industryField , null)) as emarket, count(if(industryField like '%电子商务%',industryField , null))/count(1) from data.dataanalyst group by city having emarket >= 10 order by emarket
结果和上面一样
但注意,select层面不能用别名来代替,否则会报错
提取 salary 列的上下限:
left(salary,1) # 从左截取salary列的第一个字符 locate('k', salary, 3) # 从salary列的第三个位置开始查找,返回k的位置(还是相对于整个字符串来说)
提取salary的下限(从左边开始截取):
select left(salary,locate('k',salary)-1) , salary from data.dataanalyst
提取salary的上限:
- 法1(从右边开始截取):
select left(right(salary,length(salary)-locate('-',salary)),length(salary)-locate('-',salary)-1), salary from data.dataanalyst
- 法2:substr(字符串, 从哪里开始, 截取长度)
select substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1), salary from data.dataanalyst
子查询:
select * from ( select left(salary,locate('k',salary)-1) as bottom, substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1) as top, salary from data.dataanalyst ) as t
这里的 * 可以换为 bottom,top
或平均值 (bottom+top)/2
子查询的过滤功能:
select city from data.dataanalyst group by city having count(positionId)>=100
若想知道符合条件的岗位的明细,而不仅仅是城市:
select * from data.dataanalyst where city in( select city from data.dataanalyst group by city having count(positionId)>=100)
分组(case... when...then...else...),类似于 Excel 中的 VLOOKUP:
将平均薪资分为四个档次:
select case when (bottom+top)/2 <=10 then '0~10' when (bottom+top)/2 <=20 then '10~20' # 相当于between 10 and 20 when (bottom+top)/2 <=30 then '20~30' else '30+' # 顺序判断 end, salary from ( select left(salary,locate('k',salary)-1) as bottom, substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1) as top, salary from data.dataanalyst ) as t
一张公司表 company,一张职位表 dataanalyst(只有companyId)
例:想知道某家公司,如“唯医网”的招聘职位有哪些
select * from data.dataanalyst where companyId = ( select companyId from data.company where companyShortName = "唯医网" )
Join 可以理解为Excel里的VLOOKUP(关联)
select * from data.dataanalyst as d join data.company as c on d.companyId = c.companyId
left join:不管能不能匹配,都会返回,返回的结果比 inner join 多,只是不能匹配上的会返回空值(左表是齐全的)
select * from data.dataanalyst as d left join (select * from data.company where companySize = "150-500人") as t on d.companyId = t.companyId
left join 和 join 的区别:
join为取两张表的交集,left join为取左表的全部、右表的部分
例:统计除了公司大小为“150-500人”之外的,其他所有公司有多少个
select count(1) from data.dataanalyst as d left join (select * from data.company where companySize = "150-500人") as t on d.companyId = t.companyId where t.companyId is null
例:“150-500人”公司的岗位有多少个?占比多少?
select count(1),count(t.companyId),count(t.companyId)/count(1) from data.dataanalyst as d left join (select * from data.company where companySize = "150-500人") as t on d.companyId = t.companyId
Problems - LeetCode
175题:Combine two tables
Write an SQL query to report the first name, last name, city, and state of each person in the Person table. If the address of a personId is not present in the Address table, report null instead.
select firstName,lastName,city,state from Person p left join Address a on p.personId = a.personId
176题:Second Highest Salary
Write an SQL query to report the second highest salary from the Employee table. If there is no second highest salary, the query should report null.
select max(salary) as SecondHighestSalary from Employee where salary not in( select max(salary) from Employee)
184题:Department Highest Salary
Write an SQL query to find employees who have the highest salary in each of the departments.
Return the result table in any order.
# 自己写的 select t.name as Department, Employee.name as Employee, Employee.salary as Salary from (select * from Department as d # d:部门的id和name join (select departmentId,max(salary) as salary from Employee group by departmentId) as e # e:每个部门的id及最高工资 on d.id = e.departmentId) as t # t:四列,分别是id(部门)、name(部门)、departmentId、salary join Employee # Employee:四列,分别是id(员工)、name(员工)、salary、departmentId on Employee.salary = t.salary and Employee.departmentId = t.departmentId
要考虑并列最高薪资的情况,所以还要 join Employee表
考点:group by 和 join 的混合用法,group by返回的是唯一值
# 老师写的 select t.Name as Department e.name as Employee e.salary as Salary from( select departmentId, d.name as Name, max(salary) as Salary from Employee e join Department d on d.id = e.departmentId group by departmentId) as t join Employee e on t.departmentId = e.departmentId and e.salary = t.Salary四、SQL练习题(orderinfo 和 userinfo) 建表
用户消费相关数据:记录了一个什么样的用户、在什么时候、进行了一笔什么样的消费
Apply即可
用户明细相关数据:
Apply即可
如何在命令行里导入数据
大数据量建议采用这种方式来导入数据,如果从 MySQL Workbench 中导入效率会非常低
打开命令行(cmd),输入:
mysql -u root -p --local-infile
再输入数据库密码,登录进数据库后,输入:
show global variables like 'local_infile';
是 ON 即可,接着输入:
LOAD DATA LOCAL INFILE 'C:/Users/11842/Desktop/user_info_utf.csv' INTO TABLE data.userinfo FIELDS TERMINATED BY ',';
可以看到数据成功导入
另一张表同理:
LOAD DATA LOCAL INFILE 'C:/Users/11842/Desktop/order_info_utf.csv' INTO TABLE data.orderinfo FIELDS TERMINATED BY ',';
这个过程中踩了不少坑,参考了以下帖子:
解决ERROR 3948(42000):Loading local data is disabled; this must be enabled on both the client and_cxmda的博客-CSDN博客哪位可以指导下MySQL无法导入本地文件的问题嘛?反馈说是没有权限-大数据-CSDN问答
mysql使用load data local infile导入数据Error 3948和Error 2068_jLiu2020的博客-CSDN博客_mysql2068错误
mysql 命令式的导入csv数据_恋志传奇的博客-CSDN博客_mysql命令行导入csv
ERROR3948: Loading local data is disabled - this must be enabled on both the client and server sides_young_kp的博客-CSDN博客
时间操作# 返回当下时间,常用在距今为止多长时间的场景下 select now()
SELECT paidTime, date(paidTime), date_format(paidTime,'%Y-%m') FROM data.orderinfo
date(xxx) 和 date_format(xxx,'%Y-%m-%d') 是等价的
month(xxx) 和 date_format(xxx,'%Y-%m') 的区别在于:前者只返回月份,后者还返回年份
- date_add():如 date_add(paidTime, interval 1 day) 意为 在 paidTime 日期上加上一天
- date_sub()
- datediff():两个日期的差值
练习题
- 统计不同月份的下单人数
- 统计用户三月份的回购率和复购率
-
统计男女用户的消费频次是否有差异
-
统计多次消费的用户,第一次和最后一次消费间隔是多少?(市场生命周期)
-
统计不同年龄段,用户的消费金额是否有差异?
-
统计消费的二八法则,消费的top20%用户,贡献了多少额度
SELECT month(paidTime), count(distinct useId) FROM data.orderinfo where isPaid = '已支付' group by month(paidTime)
2. 统计用户三月份的回购率和复购率
- 复购率:在这个月份里所有消费人数中,有多少人消费了一次以上,占比(同一个月)
- 回购率:三月份购买的人数,四月份依旧购买,占比(跨月份)
复购率:
SELECT useId,count(useId) FROM data.orderinfo where isPaid = '已支付' and month(paidTime) = 3 group by useId
select count(ct),count(if(ct>1, 1, null)) from( SELECT useId,count(useId) as ct FROM data.orderinfo where isPaid = '已支付' and month(paidTime) = 3 group by useId) t
两者除一下即为复购率
回购率:
select 人数 from xxx where useId in (子查询,算出3月份的useId) and month() = 4 group by useId # 去重 select count(distinct xxx) from xxx where useId in (子查询,算出3月份的useId) and month() = 4
一种更通用的方法:
# 统计每个用户在每个月是否消费过 select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo where isPaid = '已支付' group by useId, date_format(paidTime, '%Y-%m-01')
select * from( (select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo where isPaid = '已支付' group by useId, date_format(paidTime, '%Y-%m-01')) t1 left join (select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo where isPaid = '已支付' group by useId, date_format(paidTime, '%Y-%m-01')) t2 on t1.useId = t2.useId and t1.m = date_sub(t2.m, interval 1 month)) # 要求满足t1.m = t2.m减去一个月
从结果中可以看出,若没有回购,则 t2.m 列为NULL,而 count() 是不统计空值的
select t1.m, count(t1.m), count(t2.m) from( select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo where isPaid = '已支付' group by useId, date_format(paidTime, '%Y-%m-01')) t1 left join (select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo where isPaid = '已支付' group by useId, date_format(paidTime, '%Y-%m-01')) t2 on t1.useId = t2.useId and t1.m = date_sub(t2.m, interval 1 month) group by t1.m
从而 count(t2.m)/count(t1.m) 即为回购率
3. 统计男女用户的消费频次是否有差异
先统计不同用户(带上性别)的消费频次
select * from data.orderinfo as o left join( select * from data.userinfo where sex <> '') as t on o.useId = t.userId
将 left join 改为 inner join 就可以去掉那些为 NULL 的行
select userId,sex,count(1) from data.orderinfo as o inner join( select * from data.userinfo where sex <> '') as t on o.useId = t.userId group by userId,sex
再统计不同性别的消费频次
select sex, avg(ct) from( select userId,sex,count(1) as ct from data.orderinfo as o inner join( select * from data.userinfo where sex <> '') as t on o.useId = t.userId group by userId,sex) as t2 group by sex
4. 统计多次消费的用户,第一次和最后一次消费间隔是多少?(可以估算生命周期)
先提取出多次消费的用户
select useId, max(paidTime), min(paidTime), datediff(max(paidTime),min(paidTime)) from data.orderinfo where isPaid = '已支付' group by useId having count(1) > 1
datediff() 返回的是两者间隔的天数
5. 统计不同年龄段,用户的消费金额是否有差异?
先统计出每个用户的年龄段
select userId, ceil((year(now()) - year(birth))/10) from data.userinfo where birth > '1901-00-00'
ceil() 向上取整
再统计每个用户的消费频次
select userId,age,count(userId) from data.orderinfo as o inner join( select userId, ceil((year(now()) - year(birth))/10) as age from data.userinfo where birth > '1901-00-00') as t on o.useId = t.userId and o.isPaid = '已支付' group by userId,age
最后再统计不同年龄段的消费频次
select age, avg(ct) from( select userId,age,count(userId) as ct from data.orderinfo as o inner join( select userId, ceil((year(now()) - year(birth))/10) as age from data.userinfo where birth > '1901-00-00') as t on o.useId = t.userId and o.isPaid = '已支付' group by userId,age) as t2 group by age
6. 统计消费的二八法则,消费的top20%用户,贡献了多少额度
先把用户按消费金额排序
select useId, sum(price) as total from data.orderinfo where isPaid = '已支付' group by useId order by total desc
select count(useId)*0.2, sum(total) from( select useId, sum(price) as total from data.orderinfo where isPaid = '已支付' group by useId order by total desc) as t
大约为17000个
故再加个 limit 17000 即可找出消费的 top20%用户
select count(useId), sum(total) from( select useId, sum(price) as total from data.orderinfo where isPaid = '已支付' group by useId order by total desc limit 17000) as t
MySQL :: Download MySQL Connector/NET (Archived Versions)
下载与自己MySQL版本匹配的Connector,并安装到MySQL的安装路径下
重启Power BI
参考帖子:
PowerBi连接MySQL数据库_Cerelise的博客-CSDN博客_powerbi连接mysql数据库
POWERBI连接Mysql数据库:错误:需要安装一个或多个组件_CheyenneLam的博客-CSDN博客_此连接器需要安装一个或多个其他组件才能使用
Power BI连接mysql数据库--学习篇_谨忽微慎所溺的博客-CSDN博客_powerbi连接mysql
也可以通过SQL语句来导入
select useId, price, paidTime, sex, birth from data.orderinfo as o left join data.userinfo as u on o.useId = u.userId where isPaid = '已支付'
如果想要更改SQL语句的话,可以:
建议在获取数据 - MySQL数据库 - 连接时,就通过SQL语句把我们想要连接的表连接好、想要过滤的数据过滤好,而不要在Power BI里再改SQL(出于性能的考虑)
因为前者是在服务器端进行查询,把结果返回,占用的是服务器端的性能;后者使用的是自己电脑的性能
抽样
select useId, price, paidTime, sex, birth from data.orderinfo as o left join data.userinfo as u on o.useId = u.userId where isPaid = '已支付' and right(useId,1) in (0,1,2,3,4) # 抽取useId最后一位为0-4的数据