【第五周：SQL】7周成为数据分析师

本课程共七个章节，课程地址：7周成为数据分析师（完结）_哔哩哔哩_bilibili

数据分析思维
业务知识
Excel
数据可视化
SQL
统计学
Python

第五周：SQL（P61-P76）

数据库（MySQL）
Join
SQL练习题

第五周：SQL（P61-P76）

一、数据库（MySQL）

（一）下载与安装

（二）数据库基础理论

（三）MySQL Workbench的基本使用

（四）SQL基本语句（对单张表）

二、Join（多表）

三、SQL练习题（LeetCode）

175题：Combine two tables

176题：Second Highest Salary

184题：Department Highest Salary

四、SQL练习题（orderinfo 和 userinfo）

建表

如何在命令行里导入数据

时间操作

练习题

五、Power BI连接数据库

获取数据 - MySQL数据库 - 连接

也可以通过SQL语句来导入

抽样

一、数据库（MySQL）

数据库管理软件：MySQL

（一）下载与安装

1. 怎么查看自己下载的MySQL版本？

进入MySQL 8.0 Command Line Client

2. 单独下载MySQL Workbench（注意要与MySQL版本相对应）：

MySQL :: Download MySQL Workbench (Archived Versions)

界面如下：

（二）数据库基础理论

数据库是数据存储的集合
表是数据结构化的信息
列存储表中的组织信息
行存储表中的明细记录
主键是表中的唯一标示，不具备业务意义

主键：

表的主键不做强制要求，但建议设立
主键值必须唯一
每行必须有一个主键，不可为空
主键值不可被修改
主键值被删除后不可重用
表A的主键，可以作为表B的字段，此时不受约束

类型：

char（文本、富文本、可变长度）
int
float
date
布尔值
timestamp（精确到秒）

（三）MySQL Workbench的基本使用

建schema：

Apply即可

建表：

Apply即可

加载数据：

导入company_sql.csv

同理新建表 dataAnalyst 并加载数据，注意新建表时要将 positionAdvantage（varchar(100)）和 positionLabels（varchar(250)）的长度设置的大一些，否则会报错

（四）SQL基本语句（对单张表）

SQL语句的结尾一定要加上分号，表示一个命令（可能对应几条SQL语句）的结束

# 升序排列
order by companyId

# 降序排列
order by companyId desc

过滤数据（where）：

select * from data.dataanalyst
where companyId = 4184;   # 还可以 >  <  between...and...

select * from data.dataanalyst
where city = "杭州";

不等于：

!=
<>
not in (... , ...)

多条件查找（and/or）：

and（比 or 优先）
or

例：城市为上海，可以是学历为本科，或工作年限满足1-3年

select * from data.dataanalyst
where city="上海"
and (education="本科" or workYear="1-3年")

模糊查找/包含某个词（like）：

select * from data.dataanalyst
where
secondType like '%开发%'    # 包含开发这个词

分组（group by），类似于 Excel 里的数据透视表：

例：查询不同城市下招聘的职位数量

select city,count(positionId) from data.dataanalyst   # count(1)或count(*)结果一样
group by city

例：查询不同城市下招聘的公司数量（一个公司可能在一个城市同时招很多岗位，需要去重）

select city, count(positionId), count(distinct companyId) from data.dataanalyst
group by city

例：想知道每个城市下不同学历的岗位的数量

select city,education,count(1) from data.dataanalyst
group by city,education

例：只需要每个城市中岗位数在100以上的数据

针对分组（group by）后的结果进行过滤，不用where，用having

having后面还可以加 and 连接多个条件，实现多重过滤

select city,count(1) from data.dataanalyst
group by city
having count(positionId)>=100

例：找出不同城市电子商务岗位数量在50以上的城市（哪些城市电子商务岗位招了50个以上）

select city,count(1) from data.dataanalyst
where industryField like '%电子商务%'
group by city
having count(positionId)>=50

嵌套函数：

select city,count(1) from data.dataanalyst 
group by city
having count(if(industryField like "%电子商务%", 1 ,null))>=50

count() 对 null 是不计数的

例：不同城市下，电子商务岗位占这个城市招聘总数的比例？

select 
city,
count(1),
count(if(industryField like '%电子商务%',industryField , null)),
count(if(industryField like '%电子商务%',industryField , null))/count(1)
from data.dataanalyst
group by city

在上例的基础上，去掉“电子商务”岗位数量小于10的城市：

select 
city,
count(1),
count(if(industryField like '%电子商务%',industryField , null)),
count(if(industryField like '%电子商务%',industryField , null))/count(1)
from data.dataanalyst
group by city
having count(if(industryField like '%电子商务%',industryField , null)) >= 10
order by count(if(industryField like '%电子商务%',industryField , null))

把字段进行别名：

select 
city,
count(1) as total,
count(if(industryField like '%电子商务%',industryField , null)) as emarket,
count(if(industryField like '%电子商务%',industryField , null))/count(1)
from data.dataanalyst
group by city
having emarket >= 10
order by emarket

结果和上面一样

但注意，select层面不能用别名来代替，否则会报错

提取 salary 列的上下限：

left(salary,1)   # 从左截取salary列的第一个字符

locate('k', salary, 3)    # 从salary列的第三个位置开始查找，返回k的位置（还是相对于整个字符串来说）

提取salary的下限（从左边开始截取）：

select left(salary,locate('k',salary)-1) , salary from data.dataanalyst

提取salary的上限：

法1（从右边开始截取）：

select 
left(right(salary,length(salary)-locate('-',salary)),length(salary)-locate('-',salary)-1),
salary from data.dataanalyst

法2：substr(字符串, 从哪里开始, 截取长度)

select 
substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1),
salary from data.dataanalyst

子查询：

select * from (
select left(salary,locate('k',salary)-1) as bottom,
substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1) as top,
salary from data.dataanalyst ) as t

这里的 * 可以换为 bottom,top

或平均值 (bottom+top)/2

子查询的过滤功能：

select city from data.dataanalyst
group by city having count(positionId)>=100

若想知道符合条件的岗位的明细，而不仅仅是城市：

select * from data.dataanalyst
where city in(
	select city from data.dataanalyst
	group by city having count(positionId)>=100)

分组（case... when...then...else...），类似于 Excel 中的 VLOOKUP：

将平均薪资分为四个档次：

select 
    case
        when (bottom+top)/2 <=10 then '0~10'
        when (bottom+top)/2 <=20 then '10~20'    # 相当于between 10 and 20
        when (bottom+top)/2 <=30 then '20~30'
        else '30+'    # 顺序判断
    end,
    salary from (
        select left(salary,locate('k',salary)-1) as bottom,
        substr(salary, locate('-',salary)+1, length(salary)-locate('-',salary)-1) as top,
        salary from data.dataanalyst ) as t

二、Join（多表）

一张公司表 company，一张职位表 dataanalyst（只有companyId）

例：想知道某家公司，如“唯医网”的招聘职位有哪些

select * from data.dataanalyst
where companyId = (
    select companyId from data.company
    where companyShortName = "唯医网"
)

Join 可以理解为Excel里的VLOOKUP（关联）

select * from data.dataanalyst as d
join data.company as c
on d.companyId = c.companyId

left join：不管能不能匹配，都会返回，返回的结果比 inner join 多，只是不能匹配上的会返回空值（左表是齐全的）

select * from data.dataanalyst as d
left join (select * from data.company
where companySize = "150-500人") as t
on d.companyId = t.companyId

left join 和 join 的区别：

join为取两张表的交集，left join为取左表的全部、右表的部分

例：统计除了公司大小为“150-500人”之外的，其他所有公司有多少个

select count(1) from data.dataanalyst as d
left join (select * from data.company
where companySize = "150-500人") as t
on d.companyId = t.companyId
where t.companyId is null

例：“150-500人”公司的岗位有多少个？占比多少？

select count(1),count(t.companyId),count(t.companyId)/count(1) from data.dataanalyst as d
left join (select * from data.company
    where companySize = "150-500人") as t
on d.companyId = t.companyId

三、SQL练习题（LeetCode）

Problems - LeetCode

175题：Combine two tables

Write an SQL query to report the first name, last name, city, and state of each person in the Person table. If the address of a personId is not present in the Address table, report null instead.

select firstName,lastName,city,state from Person p
left join Address a
on p.personId = a.personId

176题：Second Highest Salary

Write an SQL query to report the second highest salary from the Employee table. If there is no second highest salary, the query should report null.

select max(salary) as SecondHighestSalary from Employee
where salary not in(
    select max(salary) from Employee)

184题：Department Highest Salary

Write an SQL query to find employees who have the highest salary in each of the departments.

Return the result table in any order.

# 自己写的
select t.name as Department, Employee.name as Employee, Employee.salary as Salary
from
    (select * from Department as d   # d:部门的id和name
    join
    (select departmentId,max(salary) as salary from Employee
    group by departmentId) as e    # e：每个部门的id及最高工资
    on d.id = e.departmentId) as t    # t：四列，分别是id（部门）、name（部门）、departmentId、salary
join Employee    # Employee：四列，分别是id（员工）、name（员工）、salary、departmentId
on Employee.salary = t.salary and Employee.departmentId = t.departmentId

要考虑并列最高薪资的情况，所以还要 join Employee表

考点：group by 和 join 的混合用法，group by返回的是唯一值

# 老师写的
select
t.Name as Department
e.name as Employee
e.salary as Salary
from(
    select departmentId,
           d.name as Name,
           max(salary) as Salary from Employee e
    join Department d on d.id = e.departmentId
    group by departmentId) as t
join Employee e
on t.departmentId = e.departmentId and e.salary = t.Salary

四、SQL练习题（orderinfo 和 userinfo）建表

用户消费相关数据：记录了一个什么样的用户、在什么时候、进行了一笔什么样的消费

Apply即可

用户明细相关数据：

Apply即可

如何在命令行里导入数据

大数据量建议采用这种方式来导入数据，如果从 MySQL Workbench 中导入效率会非常低

打开命令行（cmd），输入：

mysql -u root -p --local-infile

再输入数据库密码，登录进数据库后，输入：

show global variables like 'local_infile';

是 ON 即可，接着输入：

LOAD DATA LOCAL INFILE 'C:/Users/11842/Desktop/user_info_utf.csv' INTO TABLE data.userinfo FIELDS TERMINATED BY ',';

可以看到数据成功导入

另一张表同理：

LOAD DATA LOCAL INFILE 'C:/Users/11842/Desktop/order_info_utf.csv' INTO TABLE data.orderinfo FIELDS TERMINATED BY ',';

这个过程中踩了不少坑，参考了以下帖子：

解决ERROR 3948(42000):Loading local data is disabled； this must be enabled on both the client and_cxmda的博客-CSDN博客哪位可以指导下MySQL无法导入本地文件的问题嘛？反馈说是没有权限-大数据-CSDN问答

mysql使用load data local infile导入数据Error 3948和Error 2068_jLiu2020的博客-CSDN博客_mysql2068错误

mysql 命令式的导入csv数据_恋志传奇的博客-CSDN博客_mysql命令行导入csv

ERROR3948: Loading local data is disabled - this must be enabled on both the client and server sides_young_kp的博客-CSDN博客

时间操作

# 返回当下时间，常用在距今为止多长时间的场景下
select now()

SELECT paidTime, date(paidTime), date_format(paidTime,'%Y-%m') FROM data.orderinfo

date(xxx) 和 date_format(xxx,'%Y-%m-%d') 是等价的

month(xxx) 和 date_format(xxx,'%Y-%m') 的区别在于：前者只返回月份，后者还返回年份

date_add()：如 date_add(paidTime, interval 1 day) 意为在 paidTime 日期上加上一天
date_sub()
datediff()：两个日期的差值

练习题

统计不同月份的下单人数
统计用户三月份的回购率和复购率
统计男女用户的消费频次是否有差异
统计多次消费的用户，第一次和最后一次消费间隔是多少?（市场生命周期）
统计不同年龄段，用户的消费金额是否有差异？
统计消费的二八法则，消费的top20%用户，贡献了多少额度

1. 统计不同月份的下单人数

注意，未支付的时间都是空的，如下图：

SELECT month(paidTime), count(distinct useId) FROM data.orderinfo
where isPaid = '已支付'
group by month(paidTime)

2. 统计用户三月份的回购率和复购率

复购率：在这个月份里所有消费人数中，有多少人消费了一次以上，占比（同一个月）
回购率：三月份购买的人数，四月份依旧购买，占比（跨月份）

复购率：

SELECT useId,count(useId) FROM data.orderinfo
where isPaid = '已支付'
and month(paidTime) = 3
group by useId

select count(ct),count(if(ct>1, 1, null)) from(
    SELECT useId,count(useId) as ct FROM data.orderinfo
    where isPaid = '已支付'
    and month(paidTime) = 3
    group by useId) t

两者除一下即为复购率

回购率：

select 人数 from xxx
where useId in (子查询，算出3月份的useId)
and month() = 4
group by useId   # 去重


select count(distinct xxx) from xxx
where useId in (子查询，算出3月份的useId)
and month() = 4

一种更通用的方法：

# 统计每个用户在每个月是否消费过
select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo
where isPaid = '已支付'
group by useId, date_format(paidTime, '%Y-%m-01')

select * from(
	(select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo
	where isPaid = '已支付'
	group by useId, date_format(paidTime, '%Y-%m-01')) t1
left join
	(select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo
	where isPaid = '已支付'
	group by useId, date_format(paidTime, '%Y-%m-01')) t2
on t1.useId = t2.useId and t1.m = date_sub(t2.m, interval 1 month))  # 要求满足t1.m = t2.m减去一个月

从结果中可以看出，若没有回购，则 t2.m 列为NULL，而 count() 是不统计空值的

select t1.m, count(t1.m), count(t2.m) from(
	select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo
	where isPaid = '已支付'
	group by useId, date_format(paidTime, '%Y-%m-01')) t1
left join
	(select useId, date_format(paidTime, '%Y-%m-01') as m from data.orderinfo
	where isPaid = '已支付'
	group by useId, date_format(paidTime, '%Y-%m-01')) t2
on t1.useId = t2.useId and t1.m = date_sub(t2.m, interval 1 month)
group by t1.m

从而 count(t2.m)/count(t1.m) 即为回购率

3. 统计男女用户的消费频次是否有差异

先统计不同用户（带上性别）的消费频次

select * from data.orderinfo as o
left join(
	select * from data.userinfo
	where sex <> '') as t
on o.useId = t.userId

将 left join 改为 inner join 就可以去掉那些为 NULL 的行

select userId,sex,count(1) from data.orderinfo as o
inner join(
	select * from data.userinfo
	where sex <> '') as t
on o.useId = t.userId
group by userId,sex

再统计不同性别的消费频次

select sex, avg(ct) from(
	select userId,sex,count(1) as ct from data.orderinfo as o
	inner join(
		select * from data.userinfo
		where sex <> '') as t
	on o.useId = t.userId
	group by userId,sex) as t2
group by sex

4. 统计多次消费的用户，第一次和最后一次消费间隔是多少?（可以估算生命周期）

先提取出多次消费的用户

select useId, max(paidTime), min(paidTime), datediff(max(paidTime),min(paidTime)) from data.orderinfo
where isPaid = '已支付'
group by useId having count(1) > 1

datediff() 返回的是两者间隔的天数

5. 统计不同年龄段，用户的消费金额是否有差异？

先统计出每个用户的年龄段

select userId, ceil((year(now()) - year(birth))/10) from data.userinfo
where birth > '1901-00-00'

ceil() 向上取整

再统计每个用户的消费频次

select userId,age,count(userId) from data.orderinfo as o
inner join(
	select userId, ceil((year(now()) - year(birth))/10) as age from data.userinfo
	where birth > '1901-00-00') as t
on o.useId = t.userId and o.isPaid = '已支付'
group by userId,age

最后再统计不同年龄段的消费频次

select age, avg(ct) from(
	select userId,age,count(userId) as ct from data.orderinfo as o
	inner join(
		select userId, ceil((year(now()) - year(birth))/10) as age from data.userinfo
		where birth > '1901-00-00') as t
	on o.useId = t.userId and o.isPaid = '已支付'
	group by userId,age) as t2
group by age

6. 统计消费的二八法则，消费的top20%用户，贡献了多少额度

先把用户按消费金额排序

select useId, sum(price) as total from data.orderinfo
where isPaid = '已支付'
group by useId
order by total desc

select count(useId)*0.2, sum(total) from(
    select useId, sum(price) as total from data.orderinfo
    where isPaid = '已支付'
    group by useId
    order by total desc) as t

大约为17000个

故再加个 limit 17000 即可找出消费的 top20%用户

select count(useId), sum(total) from(
	select useId, sum(price) as total from data.orderinfo
	where isPaid = '已支付'
	group by useId
	order by total desc
	limit 17000) as t

五、Power BI连接数据库获取数据 - MySQL数据库 - 连接

MySQL :: Download MySQL Connector/NET (Archived Versions)

下载与自己MySQL版本匹配的Connector，并安装到MySQL的安装路径下

重启Power BI

参考帖子：

PowerBi连接MySQL数据库_Cerelise的博客-CSDN博客_powerbi连接mysql数据库

POWERBI连接Mysql数据库：错误：需要安装一个或多个组件_CheyenneLam的博客-CSDN博客_此连接器需要安装一个或多个其他组件才能使用

Power BI连接mysql数据库--学习篇_谨忽微慎所溺的博客-CSDN博客_powerbi连接mysql

也可以通过SQL语句来导入

select useId, price, paidTime, sex, birth from data.orderinfo as o 
left join data.userinfo as u
on o.useId = u.userId
where isPaid = '已支付'

如果想要更改SQL语句的话，可以：

建议在获取数据 - MySQL数据库 - 连接时，就通过SQL语句把我们想要连接的表连接好、想要过滤的数据过滤好，而不要在Power BI里再改SQL（出于性能的考虑）

因为前者是在服务器端进行查询，把结果返回，占用的是服务器端的性能；后者使用的是自己电脑的性能

抽样

select useId, price, paidTime, sex, birth from data.orderinfo as o 
left join data.userinfo as u
on o.useId = u.userId
where isPaid = '已支付'
and right(useId,1) in (0,1,2,3,4)   # 抽取useId最后一位为0-4的数据

【第五周：SQL】7周成为数据分析师

最近更新

热门博客

[ 申请 ]友情链接：