Pandas-04（缺失数据、分组、合并连接、级联）

1.缺失数据

1.1 isnull()和notnull()检测缺失数据

1.2 fillna()填充缺失值

1.3 删除NaN的行

1.4 replace()替换丢失的值或者通用值

2. 分组

2.1 groupby（）分组

2.2 get_group()选择组

3.合并连接merge()

3.1 示例

3.2 merge()合并

3.3 合并模式

4.级联concat()

1.缺失数据

由于数据有多种形式和形式，pandas 旨在灵活处理缺失数据。虽然NaN出于计算速度和方便的原因，它是默认的缺失值标记，但我们需要能够使用不同类型的数据轻松检测该值：浮点、整数、布尔值和一般对象。然而，在许多情况下，PythonNone会出现，我们也希望考虑“缺失”或“不可用”或“NA”。

1.1 isnull()和notnull()检测缺失数据

为了更容易检测缺失值（以及跨不同的数组 dtype），pandas 提供了isnull()和notnull()函数，它们也是 Series 和 DataFrame 对象的方法：

1. isnull()

示例：

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5,3),index=['a','c','e','f','h'],columns=['one','two','three'])
df = df.reindex(['a','b','c','d','e','f','h'])

df


# 输出结果：
         one	    two	      three
a	-0.864969	0.299120	-0.936382
b	     NaN	     NaN	     NaN
c	1.573142	-2.359139	0.118325
d	     NaN	     NaN	     NaN
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262

#检查缺失数据
df[df['one'].isnull()] 


# 输出结果：
	one	 two	three
b	NaN	 NaN	NaN
d	NaN  NaN	NaN

2. notnull()

#检查是不是非空的
df['one'].notnull()


# 输出结果：
a     True
b    False
c     True
d    False
e     True
f     True
h     True
Name: one, dtype: bool

输出非空结果：

df[df['one'].notnull()]

# 输出结果：
          one	    two	     three
a	-0.864969	0.299120	-0.936382
c	1.573142	-2.359139	0.118325
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262

1.2 fillna()填充缺失值

finall()可以通过几种方式用非 NA 数据“填充”NA 值。

1.将NA值替换为标量值

指定列内容[colomns]填充content：

df["colomns"].fillna("content")

2.向前或向后填补空白

可以指定method的方法pad向前填充值或使用bfill向后填充值：

df.fillna(method="pad")

3. 限制填充量

当只想连续填充一定数量的数据点，可以使用limit关键字：

df.fillna(method="pad", limit=1)

4.示例：

 #  df数据
       one	       two	       three
a	-0.864969	0.299120	-0.936382
b	NaN	NaN	NaN
c	1.573142	-2.359139	0.118325
d	NaN	NaN	NaN
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262

#可以填充我们想要的数据
df.fillna(df.mean())

#输出结果：
one	two	three
a	-0.864969	0.299120	-0.936382
b	0.323586	-0.439230	-0.238846
c	1.573142	-2.359139	0.118325
d	0.323586	-0.439230	-0.238846
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262


#将前面的数据填充进来
df.fillna(method='pad') 

#输出结果：
one	two	three
a	-0.864969	0.299120	-0.936382
b	-0.864969	0.299120	-0.936382
c	1.573142	-2.359139	0.118325
d	1.573142	-2.359139	0.118325
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262


#将后面的数据填充进来
df.fillna(method='backfill')

#输出结果 ：
one	two	three
a	-0.864969	0.299120	-0.936382
b	1.573142	-2.359139	0.118325
c	1.573142	-2.359139	0.118325
d	1.070140	-0.392129	-0.647714
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262

1.3 删除NaN的行

df.dropna() #删除有NAN的行

示例：

#删除有NAN的行
df.dropna() 

#输出结果：
one	two	three
a	-0.864969	0.299120	-0.936382
c	1.573142	-2.359139	0.118325
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262

1.4 replace()替换丢失的值或者通用值

replace({nan:替换值})

示例：

df.replace({np.nan:10})

# 输出结果：
       one	       two	      three
a	-0.864969	0.299120	-0.936382
b	10.000000	10.000000	10.000000
c	1.573142	-2.359139	0.118325
d	10.000000	10.000000	10.000000
e	1.070140	-0.392129	-0.647714
f	-0.886120	-0.926900	1.170801
h	0.725739	1.182897	-0.899262


df['four']=pd.Series([1,2,3,4,5,6,7],index=['a','b','c','d','e','f','h'])
df
#输出结果：
        one	       two	       three	four
a	-0.864969	0.299120	-0.936382	 1
b	     NaN	      NaN	      NaN	 2
c	1.573142	-2.359139	0.118325	 3
d	     NaN	      NaN	      NaN	 4
e	1.070140	-0.392129	-0.647714	 5
f	-0.886120	-0.926900	1.170801	 6
h	0.725739	1.182897	-0.899262	 7

#替换NaN为10，5为1000
df.replace({np.nan:10,5:1000})

#输出结果：
         one	     two	   three	four
a	-0.864969	0.299120	-0.936382	1
b	10.000000	10.000000	10.000000	2
c	1.573142	-2.359139	0.118325	3
d	10.000000	10.000000	10.000000	4
e	1.070140	-0.392129	-0.647714	1000
f	-0.886120	-0.926900	1.170801	6
h	0.725739	1.182897	-0.899262	7

2. 分组

pandas 对象可以在它们的任何轴上分割。分组的抽象定义是提供标签到组名的映射。

“分组依据”是指涉及以下一个或多个步骤的过程：

根据某些标准将数据分组。
独立地对每个组应用一个函数。
将结果组合成一个数据结构。

其中，拆分步骤是最直接的。事实上，在许多情况下，我们可能希望将数据集分成组，并对这些组做一些事情。

2.1 groupby（）分组

示例：

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user':['小明',"小黑",'小黄','小李'],
    'gender':['男','女','女','男'],
    'score':[99,95,70,88]
})

df

#输出结果：
	user	gender	score
0	小明	 男	      99
1	小黑	 女	      95
2	小黄	 女	      70
3	小李	 男	      88

1.进行分组

#分组对象
df.groupby('gender')

#输出结果：

2.获取分组情况

df.groupby('gender').groups#获取分组情况

#输出结果：
{'女': Int64Index([1, 2], dtype='int64'),
 '男': Int64Index([0, 3], dtype='int64')}

3. 循环迭代组

#循环迭代组
grouped = df.groupby('gender')
for name,group in grouped:
    print(name)
    print(group)

#输出结果：
 女
     user gender  score
1   小黑      女     95
2   小黄      女     70
 男
     user gender  score
0   小明      男     99
3   小李      男     88

2.2 get_group()选择组

1. get_group()基础使用

#选择组
grouped.get_group('男')

#输出结果：
	user	gender	score
0	小明	  男	   99
3	小李	  男	   88


#选择组，聚合计算
grouped.get_group('男')['score'].agg(np.mean)
grouped.get_group('女')['score'].agg(np.max)

#输出结果：
93.5
95

#获取分组里的长度（）
grouped.get_group('女').agg(np.size)

#输出结果：
user      2
gender    2
score     2
dtype: int64

2.分组聚合

df['star'] = pd.Series([5,7,4,3])
df

#输出结果：
	user	gender	score	star
0	小明	  男	  99	 5
1	小黑	  女	  95	 7
2	小黄	  女	  70	 4
3	小李	  男	  88	 3


grouped = df.groupby('gender')
#求男女平均分及星数总和
grouped[['score','star']].agg({'score':np.mean,'star':np.sum})

#输出结果：
	    score	star
gender		
女    	82.5	11
男	    93.5	8


#分组中平均分大于90的
df.groupby('gender').filter(lambda x:x['score'].mean()>90)

#输出结果：
	user	gender	score	star
0	小明	   男	  99	5
3	小李	  男	  88	3

3.合并连接merge()

在pandas中，可以使用merge()对多个数据进行合并，基础语法如下：

pd.merge(df1,df2...,on='合并索引值')

3.1 示例

import pandas as pd
import numpy as np

yuwen = pd.DataFrame({
    'id':[1,2,3,4,5,7],
    'name':["小明","小敏","小红","小黑","小王",'老陈'],
    'yuwenScore':[98,77,45,87,66,99]
    
})

shuxue = pd.DataFrame({
    'id':[1,2,3,4,5,6],
    'name':["小明","小敏","小红","小黑","小王","老李"],
    'shuxueScore':[79,56,88,92,68,88]
    
})

#  输出结果
	id	name	yuwenScore
0	1	小明	98
1	2	小敏	77
2	3	小红	45
3	4	小黑	87
4	5	小王	66
5	7	老陈	99


	id	name	shuxueScore
0	1	小明	79
1	2	小敏	56
2	3	小红	88
3	4	小黑	92
4	5	小王	68
5	6	老李	88

3.2 merge()合并

pd.merge(yuwen,shuxue,on='id')#通过id这个健合并

#输出结果：
   id	name_x	yuwenScore	name_y	shuxueScore
0	1	小明	98	        小明	79
1	2	小敏	77	        小敏	56
2	3	小红	45	        小红	88
3	4	小黑	87	        小黑	92
4	5	小王	66	        小王	68


pd.merge(yuwen,shuxue,on=['id','name'])
# 输出结果：
	id	name	yuwenScore	shuxueScore
0	1	小明	    98	       79
1	2	小敏	    77	       56
2	3	小红	    45     	   88
3	4	小黑	    87	       92
4	5	小王	    66	       68

3. 3 合并模式

merge()提供how设置合并的方式，inner为键的交集，left为使用左边的键，right为使用右边的键，ourter健的联合。

示例：

#默认是inner合并模式
pd.merge(yuwen,shuxue,on=['id','name'],how='inner')
# 输出结果
    id	name	yuwenScore	shuxueScore
0	1	小明	98	        79
1	2	小敏	77	        56
2	3	小红	45      	88
3	4	小黑	87      	92
4	5	小王	66       	68


pd.merge(yuwen,shuxue,on=['id','name'],how='right')
#输出结果
	id	name	yuwenScore	shuxueScore
0	1	小明	98.0	    79
1	2	小敏	77.0	    56
2	3	小红	45.0	    88
3	4	小黑	87.0	    92
4	5	小王	66.0	    68
5	6	老李	NaN	        88

pd.merge(yuwen,shuxue,on=['id','name'],how='outer')
#输出结果
    id	name	yuwenScore	shuxueScore
0	1	小明	98.0	    79.0
1	2	小敏	77.0	    56.0
2	3	小红	45.0	    88.0
3	4	小黑	87.0	    92.0
4	5	小王	66.0	    68.0
5	7	老陈	99.0	    NaN
6	6	老李	NaN	        88.0

pd.merge(yuwen,shuxue,on=['id','name'],how='left')
#输出结果
	id	name	yuwenScore	shuxueScore
0	1	小明	98	        79.0
1	2	小敏	77	        56.0
2	3	小红	45	        88.0
3	4	小黑	87	        92.0
4	5	小王	66	        68.0
5	7	老陈	99	        NaN

4.级联concat()

该concat()函数（在 pandas 主命名空间中）完成了沿轴执行连接操作的所有繁重工作，同时在其他轴上执行索引（如果有）的可选设置逻辑（联合或交集）。

基础语法：

pd.concat(
    objs,
    axis=0,
    join="outer",
    ignore_index=False,
    keys=None,
    levels=None,
    names=None,
    verify_integrity=False,
    copy=True,
)

objs: Series 或 DataFrame 对象的序列或映射。如果传递了 dict，则排序后的键将用作keys参数，除非传递，在这种情况下将选择值（见下文）。任何 None 对象都将被静默删除，除非它们都是 None 在这种情况下将引发 ValueError 。
axis: {0, 1, ...}，默认 0。要连接的轴。
join: {'inner', 'outer'}，默认为'outer'。如何处理其他轴上的索引。外部用于联合，内部用于交叉。
ignore_index：布尔值，默认为 False。如果为 True，则不要使用连接轴上的索引值。结果轴将标记为 0, ..., n - 1。如果您要连接对象，而连接轴没有有意义的索引信息，这将非常有用。请注意，连接中仍然尊重其他轴上的索引值。
keys：序列，默认无。使用传递的键作为最外层构建层次索引。如果通过了多个级别，则应包含元组。
levels：序列列表，默认无。用于构造 MultiIndex 的特定级别（唯一值）。否则，它们将从密钥中推断出来。
names：列表，默认无。生成的分层索引中的级别名称。
verify_integrity：布尔值，默认为 False。检查新的连接轴是否包含重复项。相对于实际的数据连接，这可能非常昂贵。
copy：布尔值，默认为真。如果为 False，则不要不必要地复制数据。

示例：

import numpy as np
import pandas as pd

one = pd.DataFrame({
    'name':["alex",'xm','xh','lc','ll'],
    'subject':['python','java','go','js','html'],
    'socre':[88,79,68,96,66]
})
two = pd.DataFrame({
    'name':["xc",'xm','xh','lc','ll'],
    'subject':['php','java','go','js','html'],
    'socre':[89,79,68,96,66]
})
 
# 输出结果
	name	subject	socre
0	alex	python	88
1	xm	    java	79
2	xh    	go	    68
3	lc	    js	    96
4	ll   	html	66


    name	subject	socre
0	xc	      php	89
1	xm	     java	79
2	xh	      go	68
3	lc	      js	96
4	ll	     html	66

忽视索引值合并：

pd.concat([one,two],ignore_index=True)

# 输出结果
    name	subject	socre
0	alex	python	88
1	xm	    java	79
2	xh	      go	68
3	lc	      js	96
4	ll	     html	66
5	xc	     php	89
6	xm	     java	79
7	xh	      go	68
8	lc	      js	96
9	ll	     html	66

按照列进行合并

#按照列进行合并
pd.concat([one,two],ignore_index=True,axis=1)

#输出结果
       0	  1	    2	  3	    4	  5
0	alex	python	88	  xc	php	  89
1	xm	    java	79	  xm	java  79
2	xh	    go	    68	  xh	go	  68
3	lc	    js	    96	  lc	js	  96
4	ll	    html	66	  ll	html  66

Pandas-04（缺失数据、分组、合并连接、级联）

最近更新

热门博客

[ 申请 ]友情链接：