Pandas is designed for working with tabular or heterogeneous data. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.
Often it will be desirable to create a Series with an index identifying each data point with a label.
# 以列表创建时,index长度必须匹配
>>> obj2 = pd.Series([4, 7, -5, 3],
...: index=['d', 'b', 'a', 'c'])
>>> obj2
d 4
b 7
a -5
c 3
dtype: int64
>>> obj2.index
>>> Index(['d', 'b', 'a', 'c'], dtype='object')
#查看单个数据
>>> obj2['a']
-5
#查看连续数据
>>> obj2['b':'c']
#选择几行查看
>>> obj2[['c', 'a', 'd']]
#修改单个数据
>>> obj2['a'] = 6
#修改连续数据
>>> obj2['b':'c'] = -5
#选择几行修改
>>> obj2[['c', 'a', 'd']] = 2
Filtering with boolean array
#根据条件筛选查看
>>> obj2[obj2 > 0]
d 2
a 2
c 2
dtype: int64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.
#暴力修改自身index,长度必须匹配
>>> obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
>>> obj2
Bob 2
Steve -5
Jeff 2
Ryan 2
dtype: int64
Should you have data contained in a Python dict, you can create a Series from it by passing the dict:
>>> sdata = {'Ohio': 35000, 'Texas': 71000,
...: 'Oregon': 16000, 'Utah': 5000}
>>> obj3 = pd.Series(sdata)
>>> obj3
Ohio 35000
Texas 71000
Oregon 16000
Utah 5000
dtype: int64
When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:
#以字典创建时,可以指定index,多出来的以NaN填充(conform)
>>> states = ['California', 'Ohio', 'Oregon', 'Texas']
>>> obj4 = pd.Series(sdata, index=states)
>>> obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# 添加,改变自身
>>> obj4['NY'] = 4000
>>> obj4
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
NY 4000.0
dtype: float64
# 删除,不改变自身
>>> obj4.drop('NY')
California NaN
Ohio 35000.0
Oregon 16000.0
Texas 71000.0
dtype: float64
# del obj4['NY'] 改变自身
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)
>>> frame
state year pop
0 Ohio 20001.5
1 Ohio 20011.7
2 Ohio 20023.6
3 Nevada 20012.4
4 Nevada 20022.9
5 Nevada 20033.2
>>> frame.values
array([['Ohio', 2000, 1.5],
['Ohio', 2001, 1.7],
['Ohio', 2002, 3.6],
['Nevada', 2001, 2.4],
['Nevada', 2002, 2.9],
['Nevada', 2003, 3.2]],
dtype=object)
>>> frame.columns
Index(['state', 'year', 'pop'], dtype='object')
>>> frame.index
RangeIndex(start=0, stop=6, step=1)
For large DataFrames, the head method selects only the first five rows:
A column in a DataFrame can be retrieved as a Series either by dict-like notation:
#查看单列
>>> frame2['state']
one Ohio
two Ohio
three Ohio
four Nevada
five Nevada
six Nevada
Name: state, dtype: object
Rows can also be retrieved by position or name with the special loc attribute:
#查看单行
>>> frame2.loc['three']
year 2002
state Ohio
pop 3.6
debt NaN
Name: three, dtype: object
#查看连续多列
>>> frame2.loc[:,'state':'debt']
state pop debt
one Ohio 1.5 NaN
two Ohio 1.7 NaN
three Ohio 3.6 NaN
four Nevada 2.4 NaN
five Nevada 2.9 NaN
six Nevada 3.2 NaN
#查看特定几列
>>> frame2.loc[:,['state','pop']]
state pop
one Ohio 1.5
two Ohio 1.7
three Ohio 3.6
four Nevada 2.4
five Nevada 2.9
six Nevada 3.2
#查看连续多行
>>> frame2.loc['two':'four',:]
year state pop debt
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
#查看特定几行
>>> frame2.loc[['two','four'],:]
year state pop debt
two 2001 Ohio 1.7 NaN
four 2001 Nevada 2.4 NaN
#查看行列
>>> frame2.loc[['two','four'],'state']
two Ohio
four Nevada
Name: state, dtype: object
#查看连续多列
>>> frame2.iloc[:,1:3]
state pop
one Ohio 1.5
two Ohio 1.7
three Ohio 3.6
four Nevada 2.4
five Nevada 2.9
six Nevada 3.2
#查看特定几列
>>> frame2.iloc[:,[1,2]]
state pop
one Ohio 1.5
two Ohio 1.7
three Ohio 3.6
four Nevada 2.4
five Nevada 2.9
six Nevada 3.2
#查看连续多行
>>> frame2.iloc[1:4,:]
year state pop debt
two 2001 Ohio 1.7 NaN
three 2002 Ohio 3.6 NaN
four 2001 Nevada 2.4 NaN
#查看特定几行
>>> frame2.iloc[[1,3],:]
year state pop debt
two 2001 Ohio 1.7 NaN
four 2001 Nevada 2.4 NaN
#查看行列
>>> frame2.iloc[[1,3],1:3]
state pop
two Ohio 1.7
four Nevada 2.4
Modify by assignment.
>>> frame2['debt'] = 16.5
>>> frame2
year state pop debt
one 2000 Ohio 1.516.5
two 2001 Ohio 1.716.5
three 2002 Ohio 3.616.5
four 2001 Nevada 2.416.5
five 2002 Nevada 2.916.5
six 2003 Nevada 3.216.5
>>> frame2['debt'] = range(6)
>>> frame2
year state pop debt
one 2000 Ohio 1.50
two 2001 Ohio 1.71
three 2002 Ohio 3.62
four 2001 Nevada 2.43
five 2002 Nevada 2.94
six 2003 Nevada 3.25
>>> f = frame2.copy()
>>> f.loc['one']=6
>>> f
year state pop debt
one 666.06
two 2001 Ohio 1.71
three 2002 Ohio 3.62
four 2001 Nevada 2.43
five 2002 Nevada 2.94
six 2003 Nevada 3.25
>>> f.loc['one']=range(4)
>>> f
year state pop debt
one 012.03
two 2001 Ohio 1.71
three 2002 Ohio 3.62
four 2001 Nevada 2.43
five 2002 Nevada 2.94
six 2003 Nevada 3.25
When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:
>>> val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])