6. 合并数据¶
对数据集合的合是基本操作之一,也是我们处理大量数据的核心操作,本章主要研究数据的合并操作。
# 准备数据
import numpy as np
import pandas as pd
def make_df(cols, ind):
'''生成一个简单的DataFrame数据'''
data = {c:[str(c) + str(i) for i in ind] for c in cols}
return pd.DataFrame(data, ind)
# 测试函数
df = make_df("ABC", range(5))
print(df)
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
4 A4 B4 C4
6.1. pandas.concat¶
通过pandas的concat能实现简单的数据合并。
6.1.1. pandas.concat实现简单的Series和DataFrame的合并¶
# 简单的合并
s1 = pd.Series(list("ABC"), index =[1,2,3])
s2 = pd.Series(list("DEF"), index =[4,5,6])
s = pd.concat([s1, s2])
print("合并后: \n", s)
合并后:
1 A
2 B
3 C
4 D
5 E
6 F
dtype: object
# DF合并
df1 = make_df("ABC", [1,2,3])
df2 = make_df("DEF", [4,5,6])
print("df1 = \n", df1)
print("\n df2 = \n", df2)
df1 =
A B C
1 A1 B1 C1
2 A2 B2 C2
3 A3 B3 C3
df2 =
D E F
4 D4 E4 F4
5 D5 E5 F5
6 D6 E6 F6
# 对两个df进行合并
df3 = pd.concat([df1, df2])
print("\n df3 = \n", df3)
df3 =
A B C D E F
1 A1 B1 C1 NaN NaN NaN
2 A2 B2 C2 NaN NaN NaN
3 A3 B3 C3 NaN NaN NaN
4 NaN NaN NaN D4 E4 F4
5 NaN NaN NaN D5 E5 F5
6 NaN NaN NaN D6 E6 F6
/Users/augs/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
# 合并的时候同样可以指定axis
df4 = pd.concat([df1, df2], axis=1)
print("\n df4 = \n", df4)
df4 =
A B C D E F
1 A1 B1 C1 NaN NaN NaN
2 A2 B2 C2 NaN NaN NaN
3 A3 B3 C3 NaN NaN NaN
4 NaN NaN NaN D4 E4 F4
5 NaN NaN NaN D5 E5 F5
6 NaN NaN NaN D6 E6 F6
6.1.2. 重复索引的处理¶
重复索引在合并过程中是默认保留的,DataFrame也允许重复索引值的出现,但会造成混淆,我们对重复索引的处理主要包括:
- 捕捉异常, 需要设置参数 verify_integrity
- 忽略重复值,需要设置参数 ignore_index
- 增加多级索引, 需要设置参数 keys
# 准备数据
df1 = make_df("AB", [0,1])
df2 = make_df("AB", [3,4])
df2.index = df1.index
print("df1 = \n", df1)
print("\n df2 = \n", df2)
df1 =
A B
0 A0 B0
1 A1 B1
df2 =
A B
0 A3 B3
1 A4 B4
## 合并df1, df2
## 合并后后出现重复索引,DataFrame允许出现重复索引
df3 = pd.concat([df1, df2])
print("\n df3 = \n", df3)
df3 =
A B
0 A0 B0
1 A1 B1
0 A3 B3
1 A4 B4
## 如果我们不允许出现重复索引,则需要设置verify_integrity
try:
pd.concat([df1, df2], verify_integrity=True)
except ValueError as e:
print("ValueError: ", e)
ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')
## 还可以选择忽略,此时自动生成心的索引,需要设置 ignore_index
df5 = pd.concat([df1, df2], ignore_index=True)
print("df5 = \n", df5)
df5 =
A B
0 A0 B0
1 A1 B1
2 A3 B3
3 A4 B4
## 还可以通过设置key增加多级索引
df6 = pd.concat([df1, df2], keys=["A", "B"])
print("df6 = \n", df6)
df6 =
A B
A 0 A0 B0
1 A1 B1
B 0 A3 B3
1 A4 B4
6.1.3. 类似join的合并¶
如果合并的数据没有相同的列名,即需要合并的数据往外没有相同的列名,此时如何处理合并后的结果需要我们特殊处理。
pd.concat给我们提供了一些选项来解决这个问题。
通过设置参数,我们可以实现:
- 交集合并,join=’inner’
- 并集合并,join=’outer’
- 自定义列名, join_axis
## 准备数据
df1 = make_df("ABC", [1,2])
df2 = make_df("BCD", [3,4])
print("df1 = \n", df1)
print("df2 = \n", df2)
df1 =
A B C
1 A1 B1 C1
2 A2 B2 C2
df2 =
B C D
3 B3 C3 D3
4 B4 C4 D4
# 默认采用的是并集合并,即join=outer
df3 = pd.concat([df1, df2])
print("df3 =\n", df3)
df3 =
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
/Users/augs/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
This is separate from the ipykernel package so we can avoid doing imports until
# 可以采用交集合并
# 合并后只出现两个列索引的交集部分
df4 = pd.concat([df1, df2], join='inner')
print("df4 =\n", df4)
df4 =
B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4
##自定义列名
df5 = pd.concat([df1, df2], join_axes=[df1.columns])
print("df5 = \n", df5)
df5 =
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
6.1.4. pandas.append¶
append方式使用简单,但是它并不是直接更新原有对象的值,而是合并后创建一个新对象,每次合并都要重新创建索引和数据缓存,相对效率比较低,如果大数据量 操作,推荐使用concat。
## append
df3 = df1.append(df2)
print("df3 = \n", df3)
df3 =
A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
/Users/augs/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:6211: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
sort=sort)
6.2. pandas.merge合并¶
本节内容类似于关系代数中的连接和合并,有关系代数或者SQL基础的同学相对会简单些。
常见的关系连接包括一对一(1:1), 多对一(N:1)和多对多(N:N), 我们使用pandas.merge可以完成相应的操作。
## 准备数据
df1 = pd.DataFrame({"name":list("ABCD"), "group":["I", "II", "III", "II"]})
df2 = pd.DataFrame({"name":list("ABCD"), "score":[61,78,74, 98]})
df3 = pd.DataFrame({"group":["I", "II", "III"], "leader":["Alice", "Bob", "Cindy"]})
# 此数据结构意味着每个组需要掌握的专业技能,一组需要有多个技能用重复值表示
df4 = pd.DataFrame({"group":["I", "I", "II","II", "II", "III", "III"],
"skills":["Linux", "Python", "Java", "Math", "English", "C++", "PHP"]})
print("df1 = \n", df1)
print("\n df2 = \n", df2)
print("\n df3 = \n", df3)
print("\n df4 = \n", df4)
df1 =
name group
0 A I
1 B II
2 C III
3 D II
df2 =
name score
0 A 61
1 B 78
2 C 74
3 D 98
df3 =
group leader
0 I Alice
1 II Bob
2 III Cindy
df4 =
group skills
0 I Linux
1 I Python
2 II Java
3 II Math
4 II English
5 III C++
6 III PHP
6.2.1. 一对一的连接¶
我们连接df1和df2的时候,每个都有索引name,通过name可以完成连接,并会自动进行合并。此时如果列的顺序不一致,并不会影响结果,pandas会自动处理。
## 一对一连接
df5 = pd.merge(df1, df2)
print("df5 = \n", df5)
df5 =
name group score
0 A I 61
1 B II 78
2 C III 74
3 D II 98
6.2.2. 多对一的连接¶
此类操作可能有重复值,此时获得的结果会自动保留重复值。
df6 = pd.merge(df1, df3)
print("df6 = \n", df6)
df6 =
name group leader
0 A I Alice
1 B II Bob
2 D II Bob
3 C III Cindy
6.2.3. 多对多的连接¶
此时因为有重复值,会自动按照最多的可能性对结果进行扩充,重复值都会得到保留。
df7 = pd.merge(df1, df4)
print("df7 = \n", df7)
df7 =
name group skills
0 A I Linux
1 A I Python
2 B II Java
3 B II Math
4 B II English
5 D II Java
6 D II Math
7 D II English
8 C III C++
9 C III PHP
6.2.4. on参数的使用¶
两个数据集合并,以上案例都是使用默认的共同有的列作为合并的依据,我们还可以指定在那一列上进行合并,这就是参数on的作用。
# 使用on参数指定合并的列
df8 = pd.merge(df1, df2, on="name")
print("df8 = \n", df8)
df8 =
name group score
0 A I 61
1 B II 78
2 C III 74
3 D II 98
6.2.5. left_on和right_on参数的使用¶
如果需要合并的数据集的列不是一个名字,就需要使用left_on和right_on来指定需要合并的两个列的名字。
例如以下两个数据集,一个列是name,一个列是my_names,我们需要这两个列进行合并,则需要进行如下面案例的操作。 在这种 情况下得到的数据会有一列冗余数据,此时可以使用drop方法将不需要的列扔掉。
## 准备数据
df1 = pd.DataFrame({"name":list("ABCD"), "group":["I", "II", "III", "II"]})
df2 = pd.DataFrame({"my_names":list("ABCD"), "score":[61,78,74, 98]})
df3 = pd.merge(df1, df2, left_on="name", right_on="my_names")
print("df3 = \n", df3)
print("\n df3.drop = \n", df3.drop("my_names", axis=1))
df3 =
name group my_names score
0 A I A 61
1 B II B 78
2 C III C 74
3 D II D 98
df3.drop =
name group score
0 A I 61
1 B II 78
2 C III 74
3 D II 98
6.2.6. left_index和right_index¶
我们以前的合并都是要求两个数据源有相同的列,如果没有相同的列可以指定,两个数据的合并还可以用键来实现合并,此时通过设置left_index/right_index来确定是否使用键作为合并的依据。
# 使用准备好的数据
print("df1 = \n", df1)
print("\n df2 = \n", df2)
df4 = pd.merge(df1, df2, left_index=True, right_index=True)
print("\n df4 = \n", df4)
df1 =
name group
0 A I
1 B II
2 C III
3 D II
df2 =
my_names score
0 A 61
1 B 78
2 C 74
3 D 98
df4 =
name group my_names score
0 A I A 61
1 B II B 78
2 C III C 74
3 D II D 98
6.2.7. how参数的使用¶
当两个数据进行合并的时候,如果把两个数据当做集合看待,则合并的方式按照关系数学,有:
- 内连接:how=’inner’, 此时结果只保留交集
- 外连接: how=’outer’, 此时结果保留的是两个数据集的并集
- 左连接: how=’left’, 此时结果保留左侧全部内容,有连接的右侧内容也会保留
- 右连接: how=’right’, 此时结果保留右侧全部内容,有链接的左侧内容也会保留
具体含义可以参考关系数学相关概念。
## 准备数据
d1 = {"name":list("ABCDE"), "score":[65,45,56,78,85]}
d2 = {"name":list("CDEFG"), "height":[176,156,187,191,173]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
print("df1 = \n", df1)
print("\n df2 = \n", df2)
df1 =
name score
0 A 65
1 B 45
2 C 56
3 D 78
4 E 85
df2 =
name height
0 C 176
1 D 156
2 E 187
3 F 191
4 G 173
# 外连接
df3 = pd.merge(df1, df2, how="outer")
print("merge.outer = \n", df3)
merge.outer =
name score height
0 A 65.0 NaN
1 B 45.0 NaN
2 C 56.0 176.0
3 D 78.0 156.0
4 E 85.0 187.0
5 F NaN 191.0
6 G NaN 173.0
# 内连接
df3 = pd.merge(df1, df2, how="inner")
print("merge.inner = \n", df3)
merge.inner =
name score height
0 C 56 176
1 D 78 156
2 E 85 187
# 左连接
df3 = pd.merge(df1, df2, how="left")
print("merge.left = \n", df3)
merge.left =
name score height
0 A 65 NaN
1 B 45 NaN
2 C 56 176.0
3 D 78 156.0
4 E 85 187.0
# 右连接
df3 = pd.merge(df1, df2, how="right")
print("merge.right = \n", df3)
merge.right =
name score height
0 C 56.0 176
1 D 78.0 156
2 E 85.0 187
3 F NaN 191
4 G NaN 173
6.2.8. suffixes参数¶
当两个数据源有重复的列名的时候,重复列名字又不作为连接操作的数据,merge函数会自动为重复列明添加_x
和_y
作为区分,但是,我们
可以通过suffixes参数设置我们需要的后缀用以区分重复。
# 准备数据
d1 = {"name":list("ABCDE"), "score":[65,45,56,78,85]}
d2 = {"name":list("CDEFG"), "score":[176,156,187,191,173]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
print("df1 = \n", df1)
print("\n df2 = \n", df2)
df1 =
name score
0 A 65
1 B 45
2 C 56
3 D 78
4 E 85
df2 =
name score
0 C 176
1 D 156
2 E 187
3 F 191
4 G 173
# 默认重复会自动加上后缀加以区分
df3 = pd.merge(df1, df2, on="name")
print("默认后缀:df3 = \n", df3)
默认后缀:df3 =
name score_x score_y
0 C 56 176
1 D 78 156
2 E 85 187
# 通过suffixes指定后缀
df3 = pd.merge(df1, df2, on="name", suffixes=['01', '02'])
print("指定后缀:df3 = \n", df3)
指定后缀:df3 =
name score01 score02
0 C 56 176
1 D 78 156
2 E 85 187