# 合并数据 对数据集合的合是基本操作之一,也是我们处理大量数据的核心操作,本章主要研究数据的合并操作。 ```python # 准备数据 import numpy as np import pandas as pd def make_df(cols, ind): '''生成一个简单的DataFrame数据''' data = {c:[str(c) + str(i) for i in ind] for c in cols} return pd.DataFrame(data, ind) # 测试函数 df = make_df("ABC", range(5)) print(df) ``` A B C 0 A0 B0 C0 1 A1 B1 C1 2 A2 B2 C2 3 A3 B3 C3 4 A4 B4 C4 ## pandas.concat 通过pandas的concat能实现简单的数据合并。 ### pandas.concat实现简单的Series和DataFrame的合并 ```python # 简单的合并 s1 = pd.Series(list("ABC"), index =[1,2,3]) s2 = pd.Series(list("DEF"), index =[4,5,6]) s = pd.concat([s1, s2]) print("合并后: \n", s) ``` 合并后: 1 A 2 B 3 C 4 D 5 E 6 F dtype: object ```python # DF合并 df1 = make_df("ABC", [1,2,3]) df2 = make_df("DEF", [4,5,6]) print("df1 = \n", df1) print("\n df2 = \n", df2) ``` df1 = A B C 1 A1 B1 C1 2 A2 B2 C2 3 A3 B3 C3 df2 = D E F 4 D4 E4 F4 5 D5 E5 F5 6 D6 E6 F6 ```python # 对两个df进行合并 df3 = pd.concat([df1, df2]) print("\n df3 = \n", df3) ``` df3 = A B C D E F 1 A1 B1 C1 NaN NaN NaN 2 A2 B2 C2 NaN NaN NaN 3 A3 B3 C3 NaN NaN NaN 4 NaN NaN NaN D4 E4 F4 5 NaN NaN NaN D5 E5 F5 6 NaN NaN NaN D6 E6 F6 /Users/augs/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. ```python # 合并的时候同样可以指定axis df4 = pd.concat([df1, df2], axis=1) print("\n df4 = \n", df4) ``` df4 = A B C D E F 1 A1 B1 C1 NaN NaN NaN 2 A2 B2 C2 NaN NaN NaN 3 A3 B3 C3 NaN NaN NaN 4 NaN NaN NaN D4 E4 F4 5 NaN NaN NaN D5 E5 F5 6 NaN NaN NaN D6 E6 F6 ### 重复索引的处理 重复索引在合并过程中是默认保留的,DataFrame也允许重复索引值的出现,但会造成混淆,我们对重复索引的处理主要包括: - 捕捉异常, 需要设置参数 verify_integrity - 忽略重复值,需要设置参数 ignore_index - 增加多级索引, 需要设置参数 keys ```python # 准备数据 df1 = make_df("AB", [0,1]) df2 = make_df("AB", [3,4]) df2.index = df1.index print("df1 = \n", df1) print("\n df2 = \n", df2) ``` df1 = A B 0 A0 B0 1 A1 B1 df2 = A B 0 A3 B3 1 A4 B4 ```python ## 合并df1, df2 ## 合并后后出现重复索引,DataFrame允许出现重复索引 df3 = pd.concat([df1, df2]) print("\n df3 = \n", df3) ``` df3 = A B 0 A0 B0 1 A1 B1 0 A3 B3 1 A4 B4 ```python ## 如果我们不允许出现重复索引,则需要设置verify_integrity try: pd.concat([df1, df2], verify_integrity=True) except ValueError as e: print("ValueError: ", e) ``` ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64') ```python ## 还可以选择忽略,此时自动生成心的索引,需要设置 ignore_index df5 = pd.concat([df1, df2], ignore_index=True) print("df5 = \n", df5) ``` df5 = A B 0 A0 B0 1 A1 B1 2 A3 B3 3 A4 B4 ```python ## 还可以通过设置key增加多级索引 df6 = pd.concat([df1, df2], keys=["A", "B"]) print("df6 = \n", df6) ``` df6 = A B A 0 A0 B0 1 A1 B1 B 0 A3 B3 1 A4 B4 ### 类似join的合并 如果合并的数据没有相同的列名,即需要合并的数据往外没有相同的列名,此时如何处理合并后的结果需要我们特殊处理。 pd.concat给我们提供了一些选项来解决这个问题。 通过设置参数,我们可以实现: - 交集合并,join='inner' - 并集合并,join='outer' - 自定义列名, join_axis ```python ## 准备数据 df1 = make_df("ABC", [1,2]) df2 = make_df("BCD", [3,4]) print("df1 = \n", df1) print("df2 = \n", df2) ``` df1 = A B C 1 A1 B1 C1 2 A2 B2 C2 df2 = B C D 3 B3 C3 D3 4 B4 C4 D4 ```python # 默认采用的是并集合并,即join=outer df3 = pd.concat([df1, df2]) print("df3 =\n", df3) ``` df3 = A B C D 1 A1 B1 C1 NaN 2 A2 B2 C2 NaN 3 NaN B3 C3 D3 4 NaN B4 C4 D4 /Users/augs/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. This is separate from the ipykernel package so we can avoid doing imports until ```python # 可以采用交集合并 # 合并后只出现两个列索引的交集部分 df4 = pd.concat([df1, df2], join='inner') print("df4 =\n", df4) ``` df4 = B C 1 B1 C1 2 B2 C2 3 B3 C3 4 B4 C4 ```python ##自定义列名 df5 = pd.concat([df1, df2], join_axes=[df1.columns]) print("df5 = \n", df5) ``` df5 = A B C 1 A1 B1 C1 2 A2 B2 C2 3 NaN B3 C3 4 NaN B4 C4 ### pandas.append append方式使用简单,但是它并不是直接更新原有对象的值,而是合并后创建一个新对象,每次合并都要重新创建索引和数据缓存,相对效率比较低,如果大数据量 操作,推荐使用concat。 ```python ## append df3 = df1.append(df2) print("df3 = \n", df3) ``` df3 = A B C D 1 A1 B1 C1 NaN 2 A2 B2 C2 NaN 3 NaN B3 C3 D3 4 NaN B4 C4 D4 /Users/augs/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:6211: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version of pandas will change to not sort by default. To accept the future behavior, pass 'sort=False'. To retain the current behavior and silence the warning, pass 'sort=True'. sort=sort) ## pandas.merge合并 本节内容类似于关系代数中的连接和合并,有关系代数或者SQL基础的同学相对会简单些。 常见的关系连接包括一对一(1:1), 多对一(N:1)和多对多(N:N), 我们使用pandas.merge可以完成相应的操作。 ```python ## 准备数据 df1 = pd.DataFrame({"name":list("ABCD"), "group":["I", "II", "III", "II"]}) df2 = pd.DataFrame({"name":list("ABCD"), "score":[61,78,74, 98]}) df3 = pd.DataFrame({"group":["I", "II", "III"], "leader":["Alice", "Bob", "Cindy"]}) # 此数据结构意味着每个组需要掌握的专业技能,一组需要有多个技能用重复值表示 df4 = pd.DataFrame({"group":["I", "I", "II","II", "II", "III", "III"], "skills":["Linux", "Python", "Java", "Math", "English", "C++", "PHP"]}) print("df1 = \n", df1) print("\n df2 = \n", df2) print("\n df3 = \n", df3) print("\n df4 = \n", df4) ``` df1 = name group 0 A I 1 B II 2 C III 3 D II df2 = name score 0 A 61 1 B 78 2 C 74 3 D 98 df3 = group leader 0 I Alice 1 II Bob 2 III Cindy df4 = group skills 0 I Linux 1 I Python 2 II Java 3 II Math 4 II English 5 III C++ 6 III PHP ### 一对一的连接 我们连接df1和df2的时候,每个都有索引name,通过name可以完成连接,并会自动进行合并。此时如果列的顺序不一致,并不会影响结果,pandas会自动处理。 ```python ## 一对一连接 df5 = pd.merge(df1, df2) print("df5 = \n", df5) ``` df5 = name group score 0 A I 61 1 B II 78 2 C III 74 3 D II 98 ### 多对一的连接 此类操作可能有重复值,此时获得的结果会自动保留重复值。 ```python df6 = pd.merge(df1, df3) print("df6 = \n", df6) ``` df6 = name group leader 0 A I Alice 1 B II Bob 2 D II Bob 3 C III Cindy ### 多对多的连接 此时因为有重复值,会自动按照最多的可能性对结果进行扩充,重复值都会得到保留。 ```python df7 = pd.merge(df1, df4) print("df7 = \n", df7) ``` df7 = name group skills 0 A I Linux 1 A I Python 2 B II Java 3 B II Math 4 B II English 5 D II Java 6 D II Math 7 D II English 8 C III C++ 9 C III PHP ### on参数的使用 两个数据集合并,以上案例都是使用默认的共同有的列作为合并的依据,我们还可以指定在那一列上进行合并,这就是参数on的作用。 ```python # 使用on参数指定合并的列 df8 = pd.merge(df1, df2, on="name") print("df8 = \n", df8) ``` df8 = name group score 0 A I 61 1 B II 78 2 C III 74 3 D II 98 ### left_on和right_on参数的使用 如果需要合并的数据集的列不是一个名字,就需要使用left_on和right_on来指定需要合并的两个列的名字。 例如以下两个数据集,一个列是name,一个列是my_names,我们需要这两个列进行合并,则需要进行如下面案例的操作。 在这种 情况下得到的数据会有一列冗余数据,此时可以使用drop方法将不需要的列扔掉。 ```python ## 准备数据 df1 = pd.DataFrame({"name":list("ABCD"), "group":["I", "II", "III", "II"]}) df2 = pd.DataFrame({"my_names":list("ABCD"), "score":[61,78,74, 98]}) df3 = pd.merge(df1, df2, left_on="name", right_on="my_names") print("df3 = \n", df3) print("\n df3.drop = \n", df3.drop("my_names", axis=1)) ``` df3 = name group my_names score 0 A I A 61 1 B II B 78 2 C III C 74 3 D II D 98 df3.drop = name group score 0 A I 61 1 B II 78 2 C III 74 3 D II 98 ### left_index和right_index 我们以前的合并都是要求两个数据源有相同的列,如果没有相同的列可以指定,两个数据的合并还可以用键来实现合并,此时通过设置left_index/right_index来确定是否使用键作为合并的依据。 ```python # 使用准备好的数据 print("df1 = \n", df1) print("\n df2 = \n", df2) df4 = pd.merge(df1, df2, left_index=True, right_index=True) print("\n df4 = \n", df4) ``` df1 = name group 0 A I 1 B II 2 C III 3 D II df2 = my_names score 0 A 61 1 B 78 2 C 74 3 D 98 df4 = name group my_names score 0 A I A 61 1 B II B 78 2 C III C 74 3 D II D 98 ### how参数的使用 当两个数据进行合并的时候,如果把两个数据当做集合看待,则合并的方式按照关系数学,有: - 内连接:how='inner', 此时结果只保留交集 - 外连接: how='outer', 此时结果保留的是两个数据集的并集 - 左连接: how='left', 此时结果保留左侧全部内容,有连接的右侧内容也会保留 - 右连接: how='right', 此时结果保留右侧全部内容,有链接的左侧内容也会保留 具体含义可以参考关系数学相关概念。 ```python ## 准备数据 d1 = {"name":list("ABCDE"), "score":[65,45,56,78,85]} d2 = {"name":list("CDEFG"), "height":[176,156,187,191,173]} df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) print("df1 = \n", df1) print("\n df2 = \n", df2) ``` df1 = name score 0 A 65 1 B 45 2 C 56 3 D 78 4 E 85 df2 = name height 0 C 176 1 D 156 2 E 187 3 F 191 4 G 173 ```python # 外连接 df3 = pd.merge(df1, df2, how="outer") print("merge.outer = \n", df3) ``` merge.outer = name score height 0 A 65.0 NaN 1 B 45.0 NaN 2 C 56.0 176.0 3 D 78.0 156.0 4 E 85.0 187.0 5 F NaN 191.0 6 G NaN 173.0 ```python # 内连接 df3 = pd.merge(df1, df2, how="inner") print("merge.inner = \n", df3) ``` merge.inner = name score height 0 C 56 176 1 D 78 156 2 E 85 187 ```python # 左连接 df3 = pd.merge(df1, df2, how="left") print("merge.left = \n", df3) ``` merge.left = name score height 0 A 65 NaN 1 B 45 NaN 2 C 56 176.0 3 D 78 156.0 4 E 85 187.0 ```python # 右连接 df3 = pd.merge(df1, df2, how="right") print("merge.right = \n", df3) ``` merge.right = name score height 0 C 56.0 176 1 D 78.0 156 2 E 85.0 187 3 F NaN 191 4 G NaN 173 ### suffixes参数 当两个数据源有重复的列名的时候,重复列名字又不作为连接操作的数据,merge函数会自动为重复列明添加`_x`和`_y`作为区分,但是,我们 可以通过suffixes参数设置我们需要的后缀用以区分重复。 ```python # 准备数据 d1 = {"name":list("ABCDE"), "score":[65,45,56,78,85]} d2 = {"name":list("CDEFG"), "score":[176,156,187,191,173]} df1 = pd.DataFrame(d1) df2 = pd.DataFrame(d2) print("df1 = \n", df1) print("\n df2 = \n", df2) ``` df1 = name score 0 A 65 1 B 45 2 C 56 3 D 78 4 E 85 df2 = name score 0 C 176 1 D 156 2 E 187 3 F 191 4 G 173 ```python # 默认重复会自动加上后缀加以区分 df3 = pd.merge(df1, df2, on="name") print("默认后缀:df3 = \n", df3) ``` 默认后缀:df3 = name score_x score_y 0 C 56 176 1 D 78 156 2 E 85 187 ```python # 通过suffixes指定后缀 df3 = pd.merge(df1, df2, on="name", suffixes=['01', '02']) print("指定后缀:df3 = \n", df3) ``` 指定后缀:df3 = name score01 score02 0 C 56 176 1 D 78 156 2 E 85 187