[Python] 纯文本查看 复制代码
一个Dataframe就是一张表格,Series表示的是一维数组,Dataframe则是一个二维数组,可以类比成一张excel的spreadsheet。也可以把Dataframe当做一组Series的集合。
创建一个DataFrame
dataframe可以由一个dictionary构造得到。
data = {'city': ['Beijing', 'Shanghai', 'Guangzhou', 'Shenzhen', 'Hangzhou', 'Chongqing'],
'year': [2016,2017,2016,2017,2016,2016],
'population': [2100, 2300, 1000, 700, 500, 500]}
pd.DataFrame(data)
city population year
0 Beijing 2100 2016
1 Shanghai 2300 2017
2 Guangzhou 1000 2016
3 Shenzhen 700 2017
4 Hangzhou 500 2016
5 Chongqing 500 2016
columns的名字和顺序可以指定
pd.DataFrame(data, columns=["year", "city", "population"])
year city population
0 2016 Beijing 2100
1 2017 Shanghai 2300
2 2016 Guangzhou 1000
3 2017 Shenzhen 700
4 2016 Hangzhou 500
5 2016 Chongqing 500
frame = pd.DataFrame(data, columns=["year", "city", "population", "debt"],
index=["one", "two", "three", "four", "five", "six"])
print(frame)
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
也可以从几个Series构建一个DataFrame
df = pd.DataFrame({"apts": apts, "cars": cars})
df
apts cars
Beijing 55000.0 300000.0
Chongqing NaN 150000.0
Guangzhou 30000.0 250000.0
Hangzhou 20000.0 NaN
Shanghai 60000.0 350000.0
Shenzhen 50000.0 300000.0
Suzhou 43000.0 NaN
Tianjian NaN 200000.0
也可以用一个list of dicts来构建DataFrame
data = [{"July": 999999, "Han": 50000, "Chu": 1000}, {"July": 90000, "Han": 8000, "Chu": 200}]
pd.DataFrame(data)
Chu Han July
0 1000 50000 999999
1 200 8000 90000
data = [{"July": 999999, "Han": 50000, "Chu": 1000}, {"July": 90000, "Han": 8000, "Chu": 200}]
pd.DataFrame(data, index=["salary", "bonux"])
Chu Han July
salary 1000 50000 999999
bonux 200 8000 90000
df["living_expense"] = df["apts"] * 100 + df["cars"]
df
apts cars living_expense
Beijing 55000.0 300000.0 5800000.0
Chongqing NaN 150000.0 NaN
Guangzhou 30000.0 250000.0 3250000.0
Hangzhou 20000.0 NaN NaN
Shanghai 60000.0 350000.0 6350000.0
Shenzhen 50000.0 300000.0 5300000.0
Suzhou 43000.0 NaN NaN
Tianjian NaN 200000.0 NaN
type(frame["city"])
pandas.core.series.Series
frame.year
one 2016
two 2017
three 2016
four 2017
five 2016
six 2016
Name: year, dtype: int64
loc方法可以拿到行
type(frame.loc["three"])
pandas.core.series.Series
frame.loc["three", "city"]
'Guangzhou'
下面这种方法默认用来选列而不是选行
iloc方法可以拿到行和列,把pandas dataframe当做numpy的ndarray来操作
frame.iloc[0:3]
year city population debt
one 2016 Beijing 2100 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
frame.iloc[0:3, 1:3]
city population
one Beijing 2100
two Shanghai 2300
three Guangzhou 1000
DataFrame元素赋值
frame.loc["one", "population"] = 2200
frame
year city population debt
one 2016 Beijing 2200 NaN
two 2017 Shanghai 2300 NaN
three 2016 Guangzhou 1000 NaN
four 2017 Shenzhen 700 NaN
five 2016 Hangzhou 500 NaN
six 2016 Chongqing 500 NaN
可以给一整列赋值
frame["debt"] = 10000000000
frame
year city population debt
one 2016 Beijing 2200 10000000000
two 2017 Shanghai 2300 10000000000
three 2016 Guangzhou 1000 10000000000
four 2017 Shenzhen 700 10000000000
five 2016 Hangzhou 500 10000000000
six 2016 Chongqing 500 10000000000
frame.loc["six"] = np.NaN
frame
year city population debt
one 2016.0 Beijing 2200.0 1.000000e+10
two 2017.0 Shanghai 2300.0 1.000000e+10
three 2016.0 Guangzhou 1000.0 1.000000e+10
four 2017.0 Shenzhen 700.0 1.000000e+10
five 2016.0 Hangzhou 500.0 1.000000e+10
six NaN NaN NaN NaN
frame.columns
Index(['year', 'city', 'population', 'debt'], dtype='object')
frame.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')
for name in frame.columns:
print(name)
year
city
population
debt
np.arange(6)
array([0, 1, 2, 3, 4, 5])
frame.debt = np.arange(6) * 10000000
frame
year city population debt
one 2016.0 Beijing 2200.0 0
two 2017.0 Shanghai 2300.0 10000000
three 2016.0 Guangzhou 1000.0 20000000
four 2017.0 Shenzhen 700.0 30000000
five 2016.0 Hangzhou 500.0 40000000
six NaN NaN NaN 50000000
还可以用Series来指定需要修改的index以及相对应的value,没有指定的默认用NaN.
val = pd.Series([100, 200, 300], index=['two', 'three', 'four'])
val * 10000
two 1000000
three 2000000
four 3000000
dtype: int64
frame["debt"] = val * 10000
frame
year city population debt
one 2016.0 Beijing 2200.0 NaN
two 2017.0 Shanghai 2300.0 1000000.0
three 2016.0 Guangzhou 1000.0 2000000.0
four 2017.0 Shenzhen 700.0 3000000.0
five 2016.0 Hangzhou 500.0 NaN
six NaN NaN NaN NaN
如果我们想要知道有哪些列,直接用columns
行的话就叫做index啦
一个DataFrame就和一个numpy 2d array一样,可以被转置
frame.T
one two three four five six
year 2016 2017 2016 2017 2016 NaN
city Beijing Shanghai Guangzhou Shenzhen Hangzhou NaN
population 2200 2300 1000 700 500 NaN
debt NaN 1e+06 2e+06 3e+06 NaN NaN
指定index的顺序,以及使用切片初始化数据
pop = {'Beijing': {2016: 2100, 2017:2200},
'Shanghai': {2015:2400, 2016:2500, 2017:2600}}
pd.DataFrame(pop, index=[2016, 2015, 2017])
Beijing Shanghai
2016 2100.0 2500
2015 NaN 2400
2017 2200.0 2600
我们还可以指定index的名字和列的名字
frame.index.name = "number"
frame.columns.name = "columns"
frame
columns year city population debt
number
one 2016.0 Beijing 2200.0 NaN
two 2017.0 Shanghai 2300.0 1000000.0
three 2016.0 Guangzhou 1000.0 2000000.0
four 2017.0 Shenzhen 700.0 3000000.0
five 2016.0 Hangzhou 500.0 NaN
six NaN NaN NaN NaN
type(df.values)
numpy.ndarray
df.as_matrix()
array([[ 55000., 300000., 5800000.],
[ nan, 150000., nan],
[ 30000., 250000., 3250000.],
[ 20000., nan, nan],
[ 60000., 350000., 6350000.],
[ 50000., 300000., 5300000.],
[ 43000., nan, nan],
[ nan, 200000., nan]])
Index
index object
obj = pd.Series(range(3), index=["a", "b", "c"])
index = obj.index
index
Index(['a', 'b', 'c'], dtype='object')
index[1:]
Index(['b', 'c'], dtype='object')
index的值是不能被更改的
# index[1] = 'd'
index = pd.Index(np.arange(3))
index
Int64Index([0, 1, 2], dtype='int64')
obj2 = pd.Series([2,5,7], index=index)
obj2
0 2
1 5
2 7
dtype: int64
obj2.index is index
True
obj2.index is np.arange(3)
False
obj2.index == np.arange(3)
array([ True, True, True], dtype=bool)
pop
frame3 = pd.DataFrame(pop)
frame3
Beijing Shanghai
2015 NaN 2400
2016 2100.0 2500
2017 2200.0 2600
print("Shanghai" in frame3.columns)
True
2015 in frame3.index
True
针对index进行索引和切片
obj["b"]
1
默认的数字index依旧可以使用
obj[[1,2]]
b 1
c 2
dtype: int64
obj[obj<1]
a 0
dtype: int64
下面介绍如何对Series进行切片
obj["b":"c"]
b 1
c 2
dtype: int64
obj["b":]
b 1
c 2
dtype: int64
对DataFrame进行Indexing与Series基本相同
df
apts cars living_expense
Beijing 55000.0 300000.0 5800000.0
Chongqing NaN 150000.0 NaN
Guangzhou 30000.0 250000.0 3250000.0
Hangzhou 20000.0 NaN NaN
Shanghai 60000.0 350000.0 6350000.0
Shenzhen 50000.0 300000.0 5300000.0
Suzhou 43000.0 NaN NaN
Tianjian NaN 200000.0 NaN
df[["cars", "apts"]]
cars apts
Beijing 300000.0 55000.0
Chongqing 150000.0 NaN
Guangzhou 250000.0 30000.0
Hangzhou NaN 20000.0
Shanghai 350000.0 60000.0
Shenzhen 300000.0 50000.0
Suzhou NaN 43000.0
Tianjian 200000.0 NaN
df[:2]
apts cars living_expense
Beijing 55000.0 300000.0 5800000.0
Chongqing NaN 150000.0 NaN
df
apts cars living_expense
Beijing 55000.0 300000.0 5800000.0
Chongqing NaN 150000.0 NaN
Guangzhou 30000.0 250000.0 3250000.0
Hangzhou 20000.0 NaN NaN
Shanghai 60000.0 350000.0 6350000.0
Shenzhen 50000.0 300000.0 5300000.0
Suzhou 43000.0 NaN NaN
Tianjian NaN 200000.0 NaN
df.loc["Chongqing":"Hangzhou", ["apts", "living_expense"]]
apts living_expense
Chongqing NaN NaN
Guangzhou 30000.0 3250000.0
Hangzhou 20000.0 NaN
df.iloc[1:3, 2:3]
living_expense
Chongqing NaN
Guangzhou 3250000.0
DataFrame也可以用condition selection
df.apts > 50000
Beijing True
Chongqing False
Guangzhou False
Hangzhou False
Shanghai True
Shenzhen False
Suzhou False
Tianjian False
Name: apts, dtype: bool
df[df.apts > 50000]
apts cars living_expense
Beijing 55000.0 300000.0 5800000.0
Shanghai 60000.0 350000.0 6350000.0
reindex
把一个Series或者DataFrame按照新的index顺序进行重排
obj = pd.Series([4.5, 2.6, -1.8, 9.4], index=["d", "b", "a", "c"])
obj
d 4.5
b 2.6
a -1.8
c 9.4
dtype: float64
obj.reindex(["a", "b", "c", "d", "e"])
a -1.8
b 2.6
c 9.4
d 4.5
e NaN
dtype: float64
如果我们reindex的index长度比原来的index长,可以指定方法来fill NaN
obj.reindex(["a", "b", "c", "d", "e"], fill_value=obj.mean())
a -1.800
b 2.600
c 9.400
d 4.500
e 3.675
dtype: float64
obj3 = pd.Series(["blue", "purple", "yello"], index=[0,2,4])
obj3
0 blue
2 purple
4 yello
dtype: object
obj3.reindex(range(6), fill_value="red")
0 blue
1 red
2 purple
3 red
4 yello
5 red
dtype: object
obj3.reindex(range(6), method="ffill") # forward fill
0 blue
1 blue
2 purple
3 purple
4 yello
5 yello
dtype: object
obj3.reindex(range(6), method="bfill") # backward fill
0 blue
1 purple
2 purple
3 yello
4 yello
5 NaN
dtype: object
既然我们可以对Series进行reindex,相应地,我们也可以用同样的方法对DataFrame进行reindex。
frame2 = frame.reindex(["one", "three", "four", "eight"])
frame2
columns year city population debt
number
one 2016.0 Beijing 2200.0 NaN
three 2016.0 Guangzhou 1000.0 2000000.0
four 2017.0 Shenzhen 700.0 3000000.0
eight NaN NaN NaN NaN
frame.reindex(columns=["city", "year", "population"])
columns city year population
number
one Beijing 2016.0 2200.0
two Shanghai 2017.0 2300.0
three Guangzhou 2016.0 1000.0
four Shenzhen 2017.0 700.0
five Hangzhou 2016.0 500.0
six NaN NaN NaN
在reindex的同时,我们还可以重新指定columns
下面介绍如何用drop来删除Series和DataFrame中的index,注意drop的效果不是in place的,也就是说他会返回一个object,原来的Obejct并没有被改变
obj4 = obj3.drop(4)
obj4
0 blue
2 purple
dtype: object
obj3.drop([2,4])
0 blue
dtype: object
frame.drop(["two", "four"])
columns year city population debt
number
one 2016.0 Beijing 2200.0 NaN
three 2016.0 Guangzhou 1000.0 2000000.0
five 2016.0 Hangzhou 500.0 NaN
six NaN NaN NaN NaN
frame.drop(["debt", "year"], axis=1)
columns city population
number
one Beijing 2200.0
two Shanghai 2300.0
three Guangzhou 1000.0
four Shenzhen 700.0
five Hangzhou 500.0
six NaN NaN
hierarchical index
Series的hierarchical indexing
data = pd.Series(np.random.randn(10), index=
[['a','a','a','b','b','c','c','c','d','d'],
[1,2,3,1,2,1,2,3,1,2]])
print(data)
data.index
MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 2, 2, 2, 3, 3], [0, 1, 2, 0, 1, 0, 1, 2, 0, 1]])
data.b
1 -0.708589
2 -0.501196
dtype: float64
data["b":"c"]
b 1 -0.708589
2 -0.501196
c 1 0.875227
2 -0.807143
3 1.672848
dtype: float64
data[2:5]
a 3 2.362866
b 1 -0.708589
2 -0.501196
dtype: float64
unstack和stack可以帮助我们在hierarchical indexing和DataFrame之间进行切换。
type(data.unstack())
pandas.core.frame.DataFrame
data.unstack().stack()
a 1 -0.346074
2 -0.602760
3 2.362866
b 1 -0.708589
2 -0.501196
c 1 0.875227
2 -0.807143
3 1.672848
d 1 0.113669
2 0.400427
dtype: float64
DataFrame的hierarchical indexing
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
index = [['a','a','b','b'], [1,2,1,2]],
columns = [['Beijing', 'Beijing', 'Shanghai'], ['apts', 'cars', 'apts']])
print(frame)
Beijing Shanghai
apts cars apts
a 1 0 1 2
2 3 4 5
b 1 6 7 8
2 9 10 11
frame.loc["a", 1]["Beijing"]["apts"]
0