Pandas的基础操作

🔖 pandas

🔖 machine learning

Author

Guangyao Zhao

Published

Dec 2, 2022

import numpy as np
import pandas as pd

data = np.random.randint(low=1, high=10, size=(3, 5))
df = pd.DataFrame(data, columns=list("abcde"), index=list("ABC"))
df

	a	b	c	d	e
A	2	7	2	9	5
B	5	7	9	2	5
C	4	4	3	2	8

1 索引操作

1.1 `set_index()`

设置新的索引：

new_index = list("abc")
df.index = new_index
df

	a	b	c	d	e
a	2	7	2	9	5
b	5	7	9	2	5
c	4	4	3	2	8

将原数据的某一列设置为新的索引，可以看到被设置为索引后原列消失：

# 将某列设置为索引
df.set_index("a")  # 或者：df.set_index('a', inplace=True)，此处的 inplace 指的是原对象 df

	b	c	d	e
a
2	7	2	9	5
5	7	9	2	5
4	4	3	2	8

可以替换后保持原列：

df.set_index("b", drop=False)

	a	b	c	d	e
b
7	2	7	2	9	5
7	5	7	9	2	5
4	4	4	3	2	8

可以保持原列且保持原索引，此处可以看出 drop 针对的是列，append 针对的是索引：

df.set_index("a", drop=False, append=True)  # append 指的是不是追加到原 index

		a	b	c	d	e
	a
a	2	2	7	2	9	5
b	5	5	7	9	2	5
c	4	4	4	3	2	8

1.2 `reset_index()`

重置索引：

df.reset_index()  # 重制索引，原索引归到列中

	index	a	b	c	d	e
0	a	2	7	2	9	5
1	b	5	7	9	2	5
2	c	4	4	3	2	8

如果想替换掉原索引：

df.reset_index(drop=True)  # 重制索引，原索引归到列中

	a	b	c	d	e
0	2	7	2	9	5
1	5	7	9	2	5
2	4	4	3	2	8

1.3 `rename()`

用 mapper 修改索引名或列名：

df.rename(mapper=str.upper, axis=1)  # 内置函数

	A	B	C	D	E
a	2	7	2	9	5
b	5	7	9	2	5
c	4	4	3	2	8

df.rename(mapper=lambda x: "pre_" + x, axis=1)  # 自定义匿名函数

	pre_a	pre_b	pre_c	pre_d	pre_e
a	2	7	2	9	5
b	5	7	9	2	5
c	4	4	3	2	8

1.4 其它

数据类型

df.index.dtype

dtype('O')

排序：

df.index.sort_values(ascending=False)

Index(['c', 'b', 'a'], dtype='object')

函数：

df.index.map(lambda x: "pre_" + x)

Index(['pre_a', 'pre_b', 'pre_c'], dtype='object')

2 数据的信息

此处主要介绍数据框的基础信息和统计信息，验证下读取数据是否和原数据信息大概一致，比如行名，列名，数据量是否缺失，各列的类型数据等等。

data = np.random.randint(low=1, high=10, size=(6, 5))
df = pd.DataFrame(data, columns=list("abcde"), index=list("ABCDEF"))
df

	a	b	c	d	e
A	4	3	7	7	1
B	6	2	9	4	4
C	5	7	8	9	6
D	7	1	2	3	4
E	2	5	4	9	8
F	9	2	7	2	4

2.1 `head(), tail(), sample()`

头部数据：

df.head(3)  # 默认是前 5 个数据

	a	b	c	d	e
A	4	3	7	7	1
B	6	2	9	4	4
C	5	7	8	9	6

尾部数据：

df.tail(4)  # 默认是后 5 个数据

	a	b	c	d	e
C	5	7	8	9	6
D	7	1	2	3	4
E	2	5	4	9	8
F	9	2	7	2	4

随机采样：

df.sample(3)  # 默认 1 条数据，可以指定条数

	a	b	c	d	e
B	6	2	9	4	4
A	4	3	7	7	1
F	9	2	7	2	4

2.2 `shape`

和 numpy 一样，pandas 也内置了数据形状查找功能：

df.shape

(6, 5)

2.3 `info()`

Pandas 有一个很好用的功能，直接可以显示出数据框的常规信息：

数据类型
索引情况
行数列数
各字段数据类型

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6 entries, A to F
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   a       6 non-null      int64
 1   b       6 non-null      int64
 2   c       6 non-null      int64
 3   d       6 non-null      int64
 4   e       6 non-null      int64
dtypes: int64(5)
memory usage: 288.0+ bytes

3 统计计算

3.1 `describe()`

Pandas 可以直接批量给出数据框的统计信息：

行数量
平均数
标准差
最小值
最大值

df.describe()

	a	b	c	d	e
count	6.000000	6.000000	6.000000	6.000000	6.000000
mean	5.500000	3.333333	6.166667	5.666667	4.500000
std	2.428992	2.250926	2.639444	3.076795	2.345208
min	2.000000	1.000000	2.000000	2.000000	1.000000
25%	4.250000	2.000000	4.750000	3.250000	4.000000
50%	5.500000	2.500000	7.000000	5.500000	4.000000
75%	6.750000	4.500000	7.750000	8.500000	5.500000
max	9.000000	7.000000	9.000000	9.000000	8.000000

3.2 `corr()`

在此强调一点，pandas 的重点在于列，也就是所谓的特征列，所以默认的情况和 numpy 强调数据的整体不同，pandas 更注重的是每列的情况，默认自然而然也就是以列为整体

df.corr()

	a	b	c	d	e
a	1.000000	-0.585279	0.109184	-0.883119	-0.368648
b	-0.585279	1.000000	0.258085	0.885598	0.568301
c	0.109184	0.258085	1.000000	0.082091	-0.242325
d	-0.883119	0.885598	0.082091	1.000000	0.443476
e	-0.368648	0.568301	-0.242325	0.443476	1.000000

同理还有 df.max(), df.min(), df.std(), df.corr() 等常用统计函数。

3.3 `idxmax()`

还有一些更有用的信息，比如每列的最大值索引：

df.idxmax()  # 同理也有 df.xmin()

a    F
b    C
c    B
d    C
e    E
dtype: object

3.4 `nunique()`

还有个很好用的功能，查询每列的集合：

df.nunique()

a    6
b    5
c    5
d    5
e    4
dtype: int64

4 非统计计算

4.1 `all(), any()`

判断列数据是否都大于某个值，计算分为两部：

判断数据大小，将结果表示为 True, False
如果该列所有都为 True，则结果为 True

df > 3

	a	b	c	d	e
A	True	False	True	True	False
B	True	False	True	True	True
C	True	True	True	True	True
D	True	False	False	False	True
E	False	True	True	True	True
F	True	False	True	False	True

(df > 3).all()

a    False
b    False
c    False
d    False
e    False
dtype: bool

any() 用法和 all() 一致，只是前者是只要有一个结果为 True 则为 True。

4.2 `round()`

data = np.random.randn(6, 5)
df = pd.DataFrame(data, columns=list("abcde"), index=list("ABCDEF"))
df

	a	b	c	d	e
A	1.403203	-0.843999	-0.725820	1.035822	-0.205633
B	-0.852134	0.563946	1.369935	0.321659	1.487192
C	0.473048	0.927135	1.017546	0.057711	-1.454395
D	0.443630	-1.281710	1.386452	-0.459800	0.497552
E	-0.366010	-1.663643	-1.238280	-1.286800	-0.898251
F	-0.171442	1.410416	-0.834794	-1.046389	-1.085320

将整个数据框四舍五入到小数点两位：

df.round(2)

	a	b	c	d	e
A	1.40	-0.84	-0.73	1.04	-0.21
B	-0.85	0.56	1.37	0.32	1.49
C	0.47	0.93	1.02	0.06	-1.45
D	0.44	-1.28	1.39	-0.46	0.50
E	-0.37	-1.66	-1.24	-1.29	-0.90
F	-0.17	1.41	-0.83	-1.05	-1.09

指定特定列的有效数字：

df.round({"a": 1, "b": 2, "c": 3, "d": 4, "e": 5})

	a	b	c	d	e
A	1.4	-0.84	-0.726	1.0358	-0.20563
B	-0.9	0.56	1.370	0.3217	1.48719
C	0.5	0.93	1.018	0.0577	-1.45440
D	0.4	-1.28	1.386	-0.4598	0.49755
E	-0.4	-1.66	-1.238	-1.2868	-0.89825
F	-0.2	1.41	-0.835	-1.0464	-1.08532

4.3 运算

df.add()
df.sub()
df.mul()
df.div()
df.mod() # 模
df.pow()
df.dot(df2) # 矩阵运算

4.4 `value_counts()`

该函数是 Series 的专有函数，在统计某列的分布时很好用

df["a"].value_counts(normalize=True)

-0.171442    0.166667
-0.852134    0.166667
-0.366010    0.166667
 0.443630    0.166667
 0.473048    0.166667
 1.403203    0.166667
Name: a, dtype: float64

5 位置计算

5.1 `diff()`

Pandas 提供了增量计算，比如上一个数据和本数据的差值，当然了，也可以计算下一个和本数据的差值：

df.diff(1)  # 上一个和本数据的差值，因第一行的前面没有数据，所以第一行为 NaN

	a	b	c	d	e
A	NaN	NaN	NaN	NaN	NaN
B	-2.255337	1.407945	2.095755	-0.714162	1.692826
C	1.325182	0.363189	-0.352389	-0.263948	-2.941588
D	-0.029418	-2.208844	0.368906	-0.517511	1.951947
E	-0.809640	-0.381933	-2.624732	-0.827001	-1.395803
F	0.194568	3.074059	0.403485	0.240411	-0.187069

df.diff(-1)  # 下一个和本数据差值，因最后一行的后面没有数据，所以最后一行为 NaN

	a	b	c	d	e
A	2.255337	-1.407945	-2.095755	0.714162	-1.692826
B	-1.325182	-0.363189	0.352389	0.263948	2.941588
C	0.029418	2.208844	-0.368906	0.517511	-1.951947
D	0.809640	0.381933	2.624732	0.827001	1.395803
E	-0.194568	-3.074059	-0.403485	-0.240411	0.187069
F	NaN	NaN	NaN	NaN	NaN

5.2 `shift()`

对数据进行移位，不做任何计算。注意，索引和列保持不变：

df.shift(periods=1, axis=0, fill_value=0)  # 移位后会出现空值，可以选择填充值

	a	b	c	d	e
A	0.000000	0.000000	0.000000	0.000000	0.000000
B	1.403203	-0.843999	-0.725820	1.035822	-0.205633
C	-0.852134	0.563946	1.369935	0.321659	1.487192
D	0.473048	0.927135	1.017546	0.057711	-1.454395
E	0.443630	-1.281710	1.386452	-0.459800	0.497552
F	-0.366010	-1.663643	-1.238280	-1.286800	-0.898251

5.3 `rank()`

该函数是将数据的数据大小序列替换到原数据中：

df.rank(axis=1)  # 行排序

	a	b	c	d	e
A	5.0	1.0	2.0	4.0	3.0
B	1.0	3.0	4.0	2.0	5.0
C	3.0	4.0	5.0	2.0	1.0
D	3.0	1.0	5.0	2.0	4.0
E	5.0	1.0	3.0	2.0	4.0
F	4.0	5.0	3.0	2.0	1.0

以上是绝对顺序，也可以转化为分位数：

df.rank(axis=1, pct=True)

	a	b	c	d	e
A	1.0	0.2	0.4	0.8	0.6
B	0.2	0.6	0.8	0.4	1.0
C	0.6	0.8	1.0	0.4	0.2
D	0.6	0.2	1.0	0.4	0.8
E	1.0	0.2	0.6	0.4	0.8
F	0.8	1.0	0.6	0.4	0.2

6 数据选择

数据选择是 pandas 最重要的一部分，方式分为三种：

df.a：取出列，推荐
df['a'], df[0]：推荐前者，不太推荐后者
df.loc['a', 'c'], df.loc[0, 10]: 轴标签，推荐
df.iloc[0, 5], df.iloc[0: 3, 1: 2]：轴索引，推荐

Tip

.loc() 和 .iloc() 只相差个 i，那是 index 的意思。

从上面可以看出，在 pands 中同样支持切片。

6.1 `loc()`

某列的所有行：

df.loc[:, "a"]

A    1.403203
B   -0.852134
C    0.473048
D    0.443630
E   -0.366010
F   -0.171442
Name: a, dtype: float64

切片：

df.loc["A":"D", "a":"c"]

	a	b	c
A	1.403203	-0.843999	-0.725820
B	-0.852134	0.563946	1.369935
C	0.473048	0.927135	1.017546
D	0.443630	-1.281710	1.386452

某几列的某些行：

df.loc["A":"D", ["a", "c"]]

	a	c
A	1.403203	-0.725820
B	-0.852134	1.369935
C	0.473048	1.017546
D	0.443630	1.386452

6.2 `iloc()`

第 1 列的所有行：

df.iloc[:, 0]

A    1.403203
B   -0.852134
C    0.473048
D    0.443630
E   -0.366010
F   -0.171442
Name: a, dtype: float64

第 1 和 4 列的所有行：

df.iloc[:, [0, 3]]

	a	d
A	1.403203	1.035822
B	-0.852134	0.321659
C	0.473048	0.057711
D	0.443630	-0.459800
E	-0.366010	-1.286800
F	-0.171442	-1.046389

第 1 行的第 2 列；第 3 行的第 3 列：

df.iloc[[0, 2], [1, 2]]

	b	c
A	-0.843999	-0.725820
C	0.927135	1.017546

7 高级操作

7.1 复杂查询

数据查询是 pandas 的最重要的功能之一，Sec. 6 的内容远远达不到要求，所以就需要更复杂的逻辑条件。和 Sec. 6 不同，此处支持以下两种形式：

df[]：不筛选列，只筛选行时使用
df.loc()：同时筛选行和列时使用

Tip

含有多逻辑时，单逻辑要加上 ()。
无论哪种筛选方法，本质上都是将符合条件的标记为 True，反之标记为 False。

7.1.1 `df[]`

这种形式的筛选只能筛选出特定行。

df[(df["a"] > 0) & (df["b"] < 0.8)]

	a	b	c	d	e
A	1.403203	-0.843999	-0.725820	1.035822	-0.205633
D	0.443630	-1.281710	1.386452	-0.459800	0.497552

df[~(df["a"] > 0)]  # 非运算，即 df['a'] < 0

	a	b	c	d	e
B	-0.852134	0.563946	1.369935	0.321659	1.487192
E	-0.366010	-1.663643	-1.238280	-1.286800	-0.898251
F	-0.171442	1.410416	-0.834794	-1.046389	-1.085320

df[df["a"] > df["b"]]

	a	b	c	d	e
A	1.403203	-0.843999	-0.725820	1.035822	-0.205633
D	0.443630	-1.281710	1.386452	-0.459800	0.497552
E	-0.366010	-1.663643	-1.238280	-1.286800	-0.898251

7.1.2 `df.loc()`

这种形式不仅能筛选出特定行，还能指定特定列。

df.loc[df["a"] > 0.1, "b":]  # a 列 大于 0.1 且只看 b 列以后的列

	b	c	d	e
A	-0.843999	-0.725820	1.035822	-0.205633
C	0.927135	1.017546	0.057711	-1.454395
D	-1.281710	1.386452	-0.459800	0.497552

df.loc[(df["a"] > 0.1) | (df["b"] < 0.3), ["a", "b"]]

	a	b
A	1.403203	-0.843999
C	0.473048	0.927135
D	0.443630	-1.281710
E	-0.366010	-1.663643

df[(df.loc[:, ["a", "b"]] > 0.1).all(axis=1)]  # a,b 列同时大于 0.1 的行

	a	b	c	d	e
C	0.473048	0.927135	1.017546	0.057711	-1.454395

7.1.3 `query()`

当逻辑非常复杂的时候，利用以上方法将会显得特别杂乱，pandas 也提供了类似 SQL 的查询语句，针对数据框的列进行查询，筛选出符合条件的行。非常推荐这种写法。

query = "a > b > 0.0"  # 选出 a > b 同时两列都大于 0.2 的行
df.query(query)

	a	b	c	d	e

quary = "(a > -0.1) & ( b< 0.3) & (c >= a + b)"
df.query(query)

	a	b	c	d	e

a_mean = df["a"].mean()
query = "b >= @a_mean + 0.1"  # 支持参数传递
df.query(query)

	a	b	c	d	e
B	-0.852134	0.563946	1.369935	0.321659	1.487192
C	0.473048	0.927135	1.017546	0.057711	-1.454395
F	-0.171442	1.410416	-0.834794	-1.046389	-1.085320

7.2 `filter()`

此函数支持对行名和列名进行筛选，支持模糊匹配，正则表达式：

df.filter(items=["a", "b"])  # 选择特定列

	a	b
A	1.403203	-0.843999
B	-0.852134	0.563946
C	0.473048	0.927135
D	0.443630	-1.281710
E	-0.366010	-1.663643
F	-0.171442	1.410416

df.filter(regex="a", axis=1)  # 列名包含 a 的列

	a
A	1.403203
B	-0.852134
C	0.473048
D	0.443630
E	-0.366010
F	-0.171442

df.filter(regex="a$", axis=1)  # 筛选出以 a 结尾的列

	a
A	1.403203
B	-0.852134
C	0.473048
D	0.443630
E	-0.366010
F	-0.171442

Warning

filter() 函数仅支持对索引和列名称进行过滤，不针对具体数据。

7.3 数据类型转换

7.3.1 `convert_dtypes()`

在开始数据分析之前，要清楚地了解数据类型，一般也无非是整型，浮点型，字符串，时间这几种。

df.convert_dtypes()  # 推断后的数据

	a	b	c	d	e
A	1.403203	-0.843999	-0.72582	1.035822	-0.205633
B	-0.852134	0.563946	1.369935	0.321659	1.487192
C	0.473048	0.927135	1.017546	0.057711	-1.454395
D	0.44363	-1.28171	1.386452	-0.4598	0.497552
E	-0.36601	-1.663643	-1.23828	-1.2868	-0.898251
F	-0.171442	1.410416	-0.834794	-1.046389	-1.08532

df.convert_dtypes().dtypes  # 推断后的数据类型

a    Float64
b    Float64
c    Float64
d    Float64
e    Float64
dtype: object

7.3.2 `to_xxx(), astype()`

强制转换数据类型

df.iloc[0, 0] = np.nan  # 将该元素设置为异常值
pd.to_numeric(df["a"], errors="coerce").fillna(0)  # 如果存在异常值，将使用 NaN填充，然后将其转换为 0

A    0.000000
B   -0.852134
C    0.473048
D    0.443630
E   -0.366010
F   -0.171442
Name: a, dtype: float64

df.iloc[0, 0] = -1
df.astype("float32").dtypes

a    float32
b    float32
c    float32
d    float32
e    float32
dtype: object

df = df.astype("int64")  # 将 float 转换为 int
df

	a	b	c	d	e
A	-1	0	0	1	0
B	0	0	1	0	1
C	0	0	1	0	-1
D	0	-1	1	0	0
E	0	-1	-1	-1	0
F	0	1	0	-1	-1

df["a"] = df["a"].astype("float64")  # 将 a 列单独设置为 float
df.dtypes

a    float64
b      int64
c      int64
d      int64
e      int64
dtype: object

7.4 排序

7.4.1 `sort_index()`

按行索引：

df.sort_index(ascending=False, axis=0)

	a	b	c	d	e
F	0.0	1	0	-1	-1
E	0.0	-1	-1	-1	0
D	0.0	-1	1	0	0
C	0.0	0	1	0	-1
B	0.0	0	1	0	1
A	-1.0	0	0	1	0

按列索引：

df.sort_index(ascending=False, axis=1)

	e	d	c	b	a
A	0	1	0	0	-1.0
B	1	0	1	0	0.0
C	-1	0	1	0	0.0
D	0	0	1	-1	0.0
E	0	-1	-1	-1	0.0
F	-1	-1	0	1	0.0

将索引设置为 0-(n-1)，类似于 df.reset_index(drop = True)：

df.sort_index(ignore_index=True, axis=0)

	a	b	c	d	e
A	-1.0	0	0	1	0
B	0.0	0	1	0	1
C	0.0	0	1	0	-1
D	0.0	-1	1	0	0
E	0.0	-1	-1	-1	0
F	0.0	1	0	-1	-1

Warning

此处需要注意，当操作对象是索引和列而不是具体数据内容时，两者均可称之为索引，即行索引和列索引。但需要注意的是对列索引排序没有什么实际意义。

7.4.2 `sort_values()`

df.sort_values(by="a", ascending=True, ignore_index=True)  # 忽视 index，即将索引初始化为 0 - (n-1)

	a	b	c	d	e
0	-1.0	0	0	1	0
1	0.0	0	1	0	1
2	0.0	0	1	0	-1
3	0.0	-1	1	0	0
4	0.0	-1	-1	-1	0
5	0.0	1	0	-1	-1

Tip

为什么要有 ignore_index 选项？因为根据值排序以后，索引的顺序会显得很凌乱，如果不使用索引的信息时可以添加此选项初始化索引值。

也可多列混合排序：

df.sort_values(by=["a", "b"])

	a	b	c	d	e
A	-1.0	0	0	1	0
D	0.0	-1	1	0	0
E	0.0	-1	-1	-1	0
B	0.0	0	1	0	1
C	0.0	0	1	0	-1
F	0.0	1	0	-1	-1

更进一步，按照索引和某列混合排序，比如班级里要按成绩和人名排名，其中人名是索引：

df1 = pd.DataFrame({"score": [99, 99, 97, 86, 86, 86, 90, 97]}, index=list("qisdndod"))
df1.index.name = "name"  # 给索引添加名字，方便进行下一步使用
df1.sort_values(by=[df1.index.name, "score"])

	score
name
d	86
d	86
d	97
i	99
n	86
o	90
q	99
s	97

7.5 高级过滤

7.5.1 `where()`

满足条件的维持原值不变，不满足的赋予新的值：

df.where(cond=df["a"] >= 0.05, other="one")

	a	b	c	d	e
A	one	one	one	one	one
B	one	one	one	one	one
C	one	one	one	one	one
D	one	one	one	one	one
E	one	one	one	one	one
F	one	one	one	one	one

7.5.2 `mask()`

该函数和 where() 相反，满足条件的赋予新值，不满足的维持不变：

df.mask(cond=df["a"] >= 0.05, other="one")

	a	b	c	d	e
A	-1.0	0	0	1	0
B	0.0	0	1	0	1
C	0.0	0	1	0	-1
D	0.0	-1	1	0	0
E	0.0	-1	-1	-1	0
F	0.0	1	0	-1	-1

7.6 数据迭代

原则上将，无论 numpy 还是 pandas 都是矢量化运算，但难免也要用到迭代处理。

常规方法：

for i, n, s in zip(df.index, df["a"], df["b"]):
    print(i, n, s)

A -1.0 0
B 0.0 0
C 0.0 0
D 0.0 -1
E 0.0 -1
F 0.0 1

7.6.1 `df.iterrows()`

生成一个可迭代对象，将数据框的行作为组成的 Series 数据对进行迭代。

for index, row in df.iterrows():
    print(index, row["a"])

A -1.0
B 0.0
C 0.0
D 0.0
E 0.0
F 0.0

7.6.2 `df.itertuples()`

封装程度更高，会将列名返回：

for row in df.itertuples(index=False):
    print(row)

Pandas(a=-1.0, b=0, c=0, d=1, e=0)
Pandas(a=0.0, b=0, c=1, d=0, e=1)
Pandas(a=0.0, b=0, c=1, d=0, e=-1)
Pandas(a=0.0, b=-1, c=1, d=0, e=0)
Pandas(a=0.0, b=-1, c=-1, d=-1, e=0)
Pandas(a=0.0, b=1, c=0, d=-1, e=-1)

7.7 函数应用

函数的目的就是为了重复利用代码，同时让整个代码结构显得更加清晰。在此主要函数有以下几种：

apply()：应用在行或者列中，可以传递多参数。推荐
map()：应用在列中的每个元素，只能传递一个参数。推荐
applymap()：应用在整个数据框中，只能传递一个参数。推荐
pipe()：应用在整个数据框中，不太推荐

Tip

还是那句话，在 pandas 中数据处理主要针对的是列，行处理使用频率不高。

7.7.1 `apply()`

可以针对某列：

df["a"].apply(lambda x: x * 2)

A   -2.0
B    0.0
C    0.0
D    0.0
E    0.0
F    0.0
Name: a, dtype: float64

也可以针对所有列：

df.apply(lambda x: x * 2)

	a	b	c	d	e
A	-2.0	0	0	2	0
B	0.0	0	2	0	2
C	0.0	0	2	0	-2
D	0.0	-2	2	0	0
E	0.0	-2	-2	-2	0
F	0.0	2	0	-2	-2

自定义函数：

# 去掉最高和最低，然后求平均值，最后添加一个 bias 项
def my_means(s, bias):
    max_min_ser = pd.Series([-s.max(), -s.min()])
    tmp = s.append(max_min_ser).sum()  # 去掉最大值和最小值
    res = tmp / (s.count() - 2)
    return res + bias


df.apply(my_means, args=[0.01], axis=0)

a    0.01
b   -0.24
c    0.51
d   -0.24
e   -0.24
dtype: float64

筛选出数型列进行运算：

df.select_dtypes(include="number").apply(my_means, args=[0.00], axis=0)

a    0.00
b   -0.25
c    0.50
d   -0.25
e   -0.25
dtype: float64

7.7.2 `map()`

和 apply() 不同，map() 只能针对某列数据进行某种操作，不可以一次性对全体列，另外 map() 的局限性在于只能传递入一个参数：

def func_map(x):
    return x * 2


df["a"].map(func_map)

A   -2.0
B    0.0
C    0.0
D    0.0
E    0.0
F    0.0
Name: a, dtype: float64

7.7.3 `applymap()`

和之前的不同，此函数可以做到针对每个元素进行操作：

def func_applymap(x):
    return x * 2 + 1


df.applymap(func_applymap)

	a	b	c	d	e
A	-1.0	1	1	3	1
B	1.0	1	3	1	3
C	1.0	1	3	1	-1
D	1.0	-1	3	1	1
E	1.0	-1	-1	-1	1
F	1.0	3	1	-1	-1

1 索引操作

1.1 set_index()

1.2 reset_index()

1.3 rename()

1.4 其它

2 数据的信息

2.1 head(), tail(), sample()

2.2 shape

2.3 info()

3 统计计算

3.1 describe()

3.2 corr()

3.3 idxmax()

3.4 nunique()

4 非统计计算

4.1 all(), any()

4.2 round()

4.3 运算

4.4 value_counts()

5 位置计算

5.1 diff()

5.2 shift()

5.3 rank()

6 数据选择

6.1 loc()

6.2 iloc()

7 高级操作

7.1 复杂查询

7.1.1 df[]

7.1.2 df.loc()

7.1.3 query()

7.2 filter()

7.3 数据类型转换

7.3.1 convert_dtypes()

7.3.2 to_xxx(), astype()

7.4 排序

7.4.1 sort_index()

7.4.2 sort_values()

7.5 高级过滤

7.5.1 where()

7.5.2 mask()

7.6 数据迭代

7.6.1 df.iterrows()

7.6.2 df.itertuples()

7.7 函数应用

7.7.1 apply()

7.7.2 map()

7.7.3 applymap()

1.1 `set_index()`

1.2 `reset_index()`

1.3 `rename()`

2.1 `head(), tail(), sample()`

2.2 `shape`

2.3 `info()`

3.1 `describe()`

3.2 `corr()`

3.3 `idxmax()`

3.4 `nunique()`

4.1 `all(), any()`

4.2 `round()`

4.4 `value_counts()`

5.1 `diff()`

5.2 `shift()`

5.3 `rank()`

6.1 `loc()`

6.2 `iloc()`

7.1.1 `df[]`

7.1.2 `df.loc()`

7.1.3 `query()`

7.2 `filter()`

7.3.1 `convert_dtypes()`

7.3.2 `to_xxx(), astype()`

7.4.1 `sort_index()`

7.4.2 `sort_values()`

7.5.1 `where()`

7.5.2 `mask()`

7.6.1 `df.iterrows()`

7.6.2 `df.itertuples()`

7.7.1 `apply()`

7.7.2 `map()`

7.7.3 `applymap()`