[Python] 纯文本查看 复制代码
pandas是一个专门用于数据分析的python library
Pandas简介
python数据分析library
基于numpy (对ndarray的操作)
有一种用python做Excel/SQL/R的感觉
目录
Series
DataFrame
Index
文件读写
## 数据结构Series
构造和初始化Series
import pandas as pd
import numpy as np
Series是一个一维的数据结构,下面是一些初始化Series的方法。
s = pd.Series([7, "Beijing", 2.17, 3.1415926, "Happy Birthday"])
s
0 7
1 Beijing
2 2.17
3 3.14159
4 Happy Birthday
dtype: object
l = [7, "Beijing", 2.17, 3.1415926, "Happy Birthday"]
l
[7, 'Beijing', 2.17, 3.1415926, 'Happy Birthday']
s[1:3]
1 Beijing
2 2.17
dtype: object
pandas会默认用0到n-1来作为Series的index,但是我们也可以自己指定index。index我们可以把它理解为dict里面的key。
s = pd.Series([7, "Beijing", 2.17, 3.1415926, "Happy Birthday"],
index=["A", "B", "C", "D", "E"])
s
A 7
B Beijing
C 2.17
D 3.14159
E Happy Birthday
dtype: object
s["D"]
3.1415926
还可以用dictionary来构造一个Series,因为Series本来就是key value pairs。
cities = {"Beijing": 55000, "Shanghai": 60000, "Shenzhen": 50000, "Hangzhou": 20000, "Guangzhou": 30000, "Suzhou": None}
apts = pd.Series(cities, name="price")
apts
Beijing 55000.0
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: price, dtype: float64
type(apts)
pandas.core.series.Series
numpy ndarray构建一个Series
list("abcde")
['a', 'b', 'c', 'd', 'e']
s = pd.Series(np.random.randn(5), index=list("abcde"))
s
a 1.434526
b -1.841912
c -1.343372
d -1.065630
e -1.978959
dtype: float64
选择数据
我们可以像对待一个list一样对待Series
apts[[4,3,2]]
Shenzhen 50000.0
Shanghai 60000.0
Hangzhou 20000.0
Name: price, dtype: float64
apts[3:]
Shanghai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: price, dtype: float64
apts[:-2]
Beijing 55000.0
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Name: price, dtype: float64
为什么下面这样会拿到两个NaN呢?
a = [1,2,3,4,5]
b = [2,3,4,5,6]
a+b
[1, 2, 3, 4, 5, 2, 3, 4, 5, 6]
apts[:-1]
Beijing 55000.0
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Name: price, dtype: float64
apts[1:]
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Suzhou NaN
Name: price, dtype: float64
apts[:-1] + apts[1:]
Beijing NaN
Guangzhou 60000.0
Hangzhou 40000.0
Shanghai 120000.0
Shenzhen 100000.0
Suzhou NaN
Name: price, dtype: float64
Series就像一个dict,前面定义的index就是用来选择数据的
apts["Guangzhou"]
30000.0
apts[["Hangzhou", "Beijing", "Shenzhen"]]
Hangzhou 20000.0
Beijing 55000.0
Shenzhen 50000.0
Name: price, dtype: float64
"Shanghai" in apts
True
"Chongqing" in apts
False
比较安全的用key读取value的方法如下
print(apts.get("Chongqing", 0))
0
下面这种写法,如果key不存在,就可能会报错了
apts[apts < 50000]
Guangzhou 30000.0
Hangzhou 20000.0
Name: price, dtype: float64
apts[apts > apts.median()]
Beijing 55000.0
Shanghai 60000.0
Name: price, dtype: float64
下面我再详细展示一下这个boolean indexing是如何工作的
less_than_50000 = apts < 50000
less_than_50000
Beijing False
Guangzhou True
Hangzhou True
Shanghai False
Shenzhen False
Suzhou False
Name: price, dtype: bool
apts[less_than_50000]
Guangzhou 30000.0
Hangzhou 20000.0
Name: price, dtype: float64
Series元素赋值
Series的元素可以被赋值
apts["Shenzhen"] = 80000
apts
Beijing 55000.0
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 80000.0
Suzhou NaN
Name: price, dtype: float64
apts[apts < 50000] = 40000
apts
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 80000.0
Suzhou NaN
Name: price, dtype: float64
前面讲过的boolean indexing在赋值的时候也可以用
数学运算
下面我们来讲一些基本的数学运算。
apts / 2
Beijing 27500.0
Guangzhou 20000.0
Hangzhou 20000.0
Shanghai 30000.0
Shenzhen 40000.0
Suzhou NaN
Name: price, dtype: float64
apts * 2
Beijing 110000.0
Guangzhou 80000.0
Hangzhou 80000.0
Shanghai 120000.0
Shenzhen 160000.0
Suzhou NaN
Name: price, dtype: float64
apts + 10000
Beijing 65000.0
Guangzhou 50000.0
Hangzhou 50000.0
Shanghai 70000.0
Shenzhen 90000.0
Suzhou NaN
Name: price, dtype: float64
apts ** 2
Beijing 3.025000e+09
Guangzhou 1.600000e+09
Hangzhou 1.600000e+09
Shanghai 3.600000e+09
Shenzhen 6.400000e+09
Suzhou NaN
Name: price, dtype: float64
np.square(apts)
Beijing 3.025000e+09
Guangzhou 1.600000e+09
Hangzhou 1.600000e+09
Shanghai 3.600000e+09
Shenzhen 6.400000e+09
Suzhou NaN
Name: price, dtype: float64
numpy的运算可以被运用到pandsa上去
我们再定义一个新的Series做加法
cars = pd.Series({"Beijing": 300000, "Shanghai": 350000, "Shenzhen": 300000,
"Tianjian": 200000, "Guangzhou": 250000, "Chongqing": 150000})
cars.astype(str)
Beijing 300000
Chongqing 150000
Guangzhou 250000
Shanghai 350000
Shenzhen 300000
Tianjian 200000
dtype: object
apts
Beijing 55000.0
Guangzhou 40000.0
Hangzhou 40000.0
Shanghai 60000.0
Shenzhen 80000.0
Suzhou NaN
Name: price, dtype: float64
cars + apts*100
Beijing 5800000.0
Chongqing NaN
Guangzhou 3250000.0
Hangzhou NaN
Shanghai 6350000.0
Shenzhen 5300000.0
Suzhou NaN
Tianjian NaN
dtype: float64
数据缺失
reference
"Hangzhou" in cars
False
apts.notnull()
Beijing True
Guangzhou True
Hangzhou True
Shanghai True
Shenzhen True
Suzhou False
Name: price, dtype: bool
apts.isnull()
Beijing False
Guangzhou False
Hangzhou False
Shanghai False
Shenzhen False
Suzhou True
Name: price, dtype: bool
apts[apts.isnull()] = apts.mean()
apts
Beijing 55000.0
Guangzhou 30000.0
Hangzhou 20000.0
Shanghai 60000.0
Shenzhen 50000.0
Suzhou 43000.0
Name: price, dtype: float64