Python Programming

Lecture 9 Pandas Basics

9.1 Pandas (1)

Series and Dataframe
  • Pandas is designed for working with tabular or heterogeneous data. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.


import pandas as pd

create, check (select), modify, add, delete (创建,查看,修改,添加,删除)


>>> obj = pd.Series([4, 7, -5, 3]) 
>>> obj
0    4
1    7
2   -5
3    3
dtype: int64
# like dictionary

>>> obj.values
array([ 4,  7, -5,  3], dtype=int64)
# numpy, like a list

>>> obj.index # like range(4)
RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label.


# 以列表创建时,index长度必须匹配
>>> obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
>>> obj2
d    4
b    7
a   -5
c    3
dtype: int64

>>> obj2.index
>>> Index(['d', 'b', 'a', 'c'], dtype='object')

#查看单个数据
>>> obj2['a']
-5
#查看连续数据
>>> obj2['b':'c']

#选择几行查看
>>> obj2[['c', 'a', 'd']]


#修改单个数据
>>> obj2['a'] = 6

#修改连续数据
>>> obj2['b':'c'] = -5

#选择几行修改
>>> obj2[['c', 'a', 'd']] = 2

Filtering with boolean array


#根据条件筛选查看
>>> obj2[obj2 > 0]
d    2
a    2
c    2
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.


#暴力修改自身index,长度必须匹配
>>> obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
>>> obj2
Bob      2
Steve   -5
Jeff     2
Ryan     2
dtype: int64

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:


#第二种创建方式,以字典创建,可以指定index,多出来的以NaN填充(conform)
>>> sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
>>> states = ['California', 'Ohio', 'Oregon', 'Texas']
>>> obj4 = pd.Series(sdata, index=states)
>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

# 添加,改变自身
>>> obj4['NY'] = 4000
>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
NY             4000.0
dtype: float64

# 删除,不改变自身
>>> obj4.drop('NY')
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
 
# del obj4['NY'] 改变自身
DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).


data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

>>> frame
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2

>>> frame.values
array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], 
       dtype=object)

>>> frame.columns
Index(['state', 'year', 'pop'], dtype='object')

>>> frame.index
RangeIndex(start=0, stop=6, step=1)

For large DataFrames, the head method selects only the first five rows:


>>> frame.head() #.head(100)
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9

>>> pd.DataFrame(data, columns=['year', 'state', 'pop'])
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

If you pass a column (not index, index must match!) that isn't contained in the dict, it will appear with missing values in the result:


# 以字典+列表创建时,可以指定columns,多出来的以NaN填充(conform)
# 以字典+列表创建时,index必须匹配
>>> frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'],
....: index=['one','two','three','four','five','six'])

>>> frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

>>> frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

>>> frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')

# 暴力修改自身columns和index,长度必须匹配
# frame2.columns = [...]
# frame2.index = [...]

A column in a DataFrame can be retrieved as a Series either by dict-like notation:


#查看单列
>>> frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Rows can also be retrieved by position or name with the special loc attribute:


#查看单行
>>> frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

#查看连续多列
>>> frame2.loc[:,'state':'debt']
        state  pop debt
one      Ohio  1.5  NaN
two      Ohio  1.7  NaN
three    Ohio  3.6  NaN
four   Nevada  2.4  NaN
five   Nevada  2.9  NaN
six    Nevada  3.2  NaN

#查看特定几列
>>> frame2.loc[:,['state','pop']]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2

#查看连续多行
>>> frame2.loc['two':'four',:]
       year   state  pop debt
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN

#查看特定几行
>>> frame2.loc[['two','four'],:]
      year   state  pop debt
two   2001    Ohio  1.7  NaN
four  2001  Nevada  2.4  NaN

#查看行列
>>> frame2.loc[['two','four'],'state']

two       Ohio
four    Nevada
Name: state, dtype: object

#查看连续多列
>>> frame2.iloc[:,1:3]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2

#查看特定几列
>>> frame2.iloc[:,[1,2]]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2

#查看连续多行
>>> frame2.iloc[1:4,:]
       year   state  pop debt
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN

#查看特定几行
>>> frame2.iloc[[1,3],:]
      year   state  pop debt
two   2001    Ohio  1.7  NaN
four  2001  Nevada  2.4  NaN

#查看行列
>>> frame2.iloc[[1,3],1:3]

       state  pop
two     Ohio  1.7
four  Nevada  2.4

Modify by assignment.


>>> frame2['debt'] = 16.5
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5

>>> frame2['debt'] = range(6)
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5

>>> f = frame2.copy()
>>> f.loc['one']=6
>>> f
       year   state  pop  debt
one       6       6  6.0     6
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5

>>> f.loc['one']=range(4)
>>> f

       year   state  pop  debt
one       0       1  2.0     3
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5

When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:


>>> val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
>>> frame2['debt'] = val
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN

#添加列
>>> frame2['eastern'] = (frame2.state == 'Ohio')
>>> frame2
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False

#添加单行,must match, otherwise error!
>>> frame2.loc['seven']=[2004, 'Nevada', 3.7, -2,False]
>>> frame2
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False
seven  2004  Nevada  3.7  -2.0    False
# 添加多行会用到 concat()

#删除单列 (不会改变原来的表格,要赋值创建新表格)
>>> frame2.drop('eastern',axis=1)

#删除多列
>>> frame2.drop(['year','state'], axis=1)
>>> frame2.drop(frame2.columns[[0,3]], axis=1)

#删除单行
>>> frame2.drop('one',axis=0)

#删除多行
>>> frame2.drop(['one','two'], axis=0)
>>> frame2.drop(frame2.index[[0,3]], axis=0)

9.2 Pandas (2)

Forming, Reindex, Arithmetics

Another way of forming a dataframe


>>> pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

>>> frame3 = pd.DataFrame(pop)
>>> frame3
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5

The keys in the inner dicts are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:


#以字典+字典创建时,可以指定index,多出来的以NaN填充(conform)
#以字典+字典创建时,可以指定columns,多出来的以NaN填充(conform)
>>> pd.DataFrame(pop, index=[2001, 2002, 2003])
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

>>> frame3.index.name = 'year' 
>>> frame3.columns.name = 'state'
>>> frame3
state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2003      NaN   NaN 

Reindexing (conform)


>>> frame4 = frame3.reindex([2001,2003,2005],fill_value=0)
>>> frame4
state  Nevada  Ohio
year               
2001      2.4   1.7
2003      NaN   NaN
2005      0.0   0.0

>>> frame4.reindex(range(2000,2005), method="ffill")
state  Nevada  Ohio
year               
2000      NaN   NaN
2001      2.4   1.7
2002      2.4   1.7
2003      NaN   NaN
2004      NaN   NaN

>>> frame5 = frame4.reindex(columns=['Nevada', 'Washington'])  
>>> frame5
state  Nevada  Washington
year                     
2001      2.4         NaN
2003      NaN         NaN
2005      0.0         NaN                        

Summary

创建

  • Series
    • 以列表创建,index (match)
    • 以字典创建,index (conform)
  • DataFrame
    • 以字典+列表创建,columns (conform), index (match)
    • 以字典+字典创建,columns, index (conform)

修改行和列的编号

  • df.index = [...] 暴力修改 (match)
  • df.columns = [...] 暴力修改 (match)
  • .reindex([...]) 指定index (conform)
  • .reindex(columns=[...]) 指定columns (conform)

Operations between Series


>>> obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

>>> obj3.add(obj4) # obj3 + obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

# add, radd (+), sub, rsub (-)
# div, rdiv (/), floordiv, rfloordiv (//)
# mul, rmul (*), pow, rpow (**)

>>> obj3.add(obj4,fill_value=0)
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah            5000.0
dtype: float64


# fill_value is for missing items
# not for missing data!

Operations between DataFrames


>>> frame3
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5

>>> frame4
      Nevada  Ohio
2001     2.4   1.7
2003     0.0   0.0
2005     0.0   0.0

>>> frame3 + frame4
      Nevada  Ohio
2000     NaN   NaN
2001     4.8   3.4
2002     NaN   NaN
2003     NaN   NaN
2005     NaN   NaN

>>> frame3.add(frame4,fill_value=0)
      Nevada  Ohio
2000     NaN   1.5
2001     4.8   3.4
2002     2.9   3.6
2003     0.0   0.0
2005     0.0   0.0

Operations between DataFrame and Series


>>> frame6 = pd.DataFrame([[2.4,1.7],[2.9,3.6],[2.1,2.5]],
...: columns = ['Nevada', 'Ohio'], index = [2001,2002,2003])
>>> frame6 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     2.1   2.5

>>> series_row = frame6.iloc[0]
>>> series_row
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64

>>> frame6 - series_row
      Nevada  Ohio
2001     0.0   0.0
2002     0.5   1.9
2003    -0.3   0.8

>>> series_col = frame6['Ohio']
>>> series_col
2001    1.7
2002    3.6
2003    2.5
Name: Ohio, dtype: float64

>>> frame6.sub(series_col, axis=0)
# arithmetic methods, axis=0 match index
      Nevada  Ohio
2001     0.7   0.0
2002    -0.7   0.0
2003    -0.4   0.0

Summary

  • Pandas
    • Reading: Python for Data Analysis, Chapter 5