Python Programming

Lecture 14 Data Processing and Analysis

14.1 Pandas (1)

Series and Dataframe

Pandas is designed for working with tabular or heterogeneous data. It contains data structures and data manipulation tools designed to make data cleaning and analysis fast and easy in Python.


import pandas as pd

create, check (select), modify, add, delete (创建，查看，修改，添加，删除)


>>> obj = pd.Series([4, 7, -5, 3]) 
>>> obj
0    4
1    7
2   -5
3    3
dtype: int64
# like dictionary


>>> obj.values
array([ 4,  7, -5,  3], dtype=int64)
# numpy, like a list

>>> obj.index # like range(4)
RangeIndex(start=0, stop=4, step=1)

Often it will be desirable to create a Series with an index identifying each data point with a label.


# 以列表创建时，index长度必须匹配
>>> obj2 = pd.Series([4, 7, -5, 3], 
...: index=['d', 'b', 'a', 'c'])
>>> obj2
d    4
b    7
a   -5
c    3
dtype: int64

>>> obj2.index
>>> Index(['d', 'b', 'a', 'c'], dtype='object')


#查看单个数据
>>> obj2['a']
-5
#查看连续数据
>>> obj2['b':'c']

#选择几行查看
>>> obj2[['c', 'a', 'd']]


#修改单个数据
>>> obj2['a'] = 6

#修改连续数据
>>> obj2['b':'c'] = -5

#选择几行修改
>>> obj2[['c', 'a', 'd']] = 2

Filtering with boolean array


#根据条件筛选查看
>>> obj2[obj2 > 0]
d    2
a    2
c    2
dtype: int64

Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values.


#暴力修改自身index，长度必须匹配
>>> obj2.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
>>> obj2
Bob      2
Steve   -5
Jeff     2
Ryan     2
dtype: int64

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:


>>> sdata = {'Ohio': 35000, 'Texas': 71000, 
...:        'Oregon': 16000, 'Utah': 5000}
>>> obj3 = pd.Series(sdata)
>>> obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

When you are only passing a dict, the index in the resulting Series will have the dict's keys in sorted order. You can override this by passing the dict keys in the order you want them to appear in the resulting Series:


#以字典创建时，可以指定index，多出来的以NaN填充（conform）
>>> states = ['California', 'Ohio', 'Oregon', 'Texas']
>>> obj4 = pd.Series(sdata, index=states)
>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


# 添加，改变自身
>>> obj4['NY'] = 4000
>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
NY             4000.0
dtype: float64


# 删除，不改变自身
>>> obj4.drop('NY')
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64
 
# del obj4['NY'] 改变自身

DataFrame

A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).


data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002, 2003],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)


>>> frame
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


>>> frame.values
array([['Ohio', 2000, 1.5],
       ['Ohio', 2001, 1.7],
       ['Ohio', 2002, 3.6],
       ['Nevada', 2001, 2.4],
       ['Nevada', 2002, 2.9],
       ['Nevada', 2003, 3.2]], 
       dtype=object)


>>> frame.columns
Index(['state', 'year', 'pop'], dtype='object')

>>> frame.index
RangeIndex(start=0, stop=6, step=1)

For large DataFrames, the head method selects only the first five rows:


>>> frame.head() #.head(100)
    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9


>>> pd.DataFrame(data, columns=['year', 'state', 'pop'])
   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2

If you pass a column (not index, index must match!) that isn't contained in the dict, it will appear with missing values in the result:


# 以字典+列表创建时，可以指定columns，多出来的以NaN填充（conform）
# 以字典+列表创建时，index必须匹配
>>> frame2 = pd.DataFrame(data, columns=['year','state','pop','debt'],
....: index=['one','two','three','four','five','six'])


>>> frame2
       year   state  pop debt
one    2000    Ohio  1.5  NaN
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN
five   2002  Nevada  2.9  NaN
six    2003  Nevada  3.2  NaN

>>> frame2.columns
Index(['year', 'state', 'pop', 'debt'], dtype='object')

>>> frame2.index
Index(['one', 'two', 'three', 'four', 'five', 'six'], dtype='object')


# 暴力修改自身columns和index，长度必须匹配
# frame2.columns = [...]
# frame2.index = [...]

A column in a DataFrame can be retrieved as a Series either by dict-like notation:


#查看单列
>>> frame2['state']
one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
six      Nevada
Name: state, dtype: object

Rows can also be retrieved by position or name with the special loc attribute:


#查看单行
>>> frame2.loc['three']
year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object


#查看连续多列
>>> frame2.loc[:,'state':'debt']
        state  pop debt
one      Ohio  1.5  NaN
two      Ohio  1.7  NaN
three    Ohio  3.6  NaN
four   Nevada  2.4  NaN
five   Nevada  2.9  NaN
six    Nevada  3.2  NaN

#查看特定几列
>>> frame2.loc[:,['state','pop']]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2


#查看连续多行
>>> frame2.loc['two':'four',:]
       year   state  pop debt
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN

#查看特定几行
>>> frame2.loc[['two','four'],:]
      year   state  pop debt
two   2001    Ohio  1.7  NaN
four  2001  Nevada  2.4  NaN

#查看行列
>>> frame2.loc[['two','four'],'state']

two       Ohio
four    Nevada
Name: state, dtype: object


#查看连续多列
>>> frame2.iloc[:,1:3]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2

#查看特定几列
>>> frame2.iloc[:,[1,2]]
        state  pop
one      Ohio  1.5
two      Ohio  1.7
three    Ohio  3.6
four   Nevada  2.4
five   Nevada  2.9
six    Nevada  3.2


#查看连续多行
>>> frame2.iloc[1:4,:]
       year   state  pop debt
two    2001    Ohio  1.7  NaN
three  2002    Ohio  3.6  NaN
four   2001  Nevada  2.4  NaN

#查看特定几行
>>> frame2.iloc[[1,3],:]
      year   state  pop debt
two   2001    Ohio  1.7  NaN
four  2001  Nevada  2.4  NaN

#查看行列
>>> frame2.iloc[[1,3],1:3]

       state  pop
two     Ohio  1.7
four  Nevada  2.4

Modify by assignment.


>>> frame2['debt'] = 16.5
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5  16.5
two    2001    Ohio  1.7  16.5
three  2002    Ohio  3.6  16.5
four   2001  Nevada  2.4  16.5
five   2002  Nevada  2.9  16.5
six    2003  Nevada  3.2  16.5


>>> frame2['debt'] = range(6)
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5     0
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5


>>> f = frame2.copy()
>>> f.loc['one']=6
>>> f
       year   state  pop  debt
one       6       6  6.0     6
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5


>>> f.loc['one']=range(4)
>>> f

       year   state  pop  debt
one       0       1  2.0     3
two    2001    Ohio  1.7     1
three  2002    Ohio  3.6     2
four   2001  Nevada  2.4     3
five   2002  Nevada  2.9     4
six    2003  Nevada  3.2     5

When you are assigning lists or arrays to a column, the value's length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame's index, inserting missing values in any holes:


>>> val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])
>>> frame2['debt'] = val
>>> frame2
       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


#添加列
>>> frame2['eastern'] = (frame2.state == 'Ohio')
>>> frame2
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False


#添加行，must match, otherwise error!
>>> frame2.loc['seven']=[2004, 'Nevada', 3.7, -2,False]
>>> frame2
       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False
seven  2004  Nevada  3.7  -2.0    False


>>> frame2.append({'year':2004,'state':'Nevada','pop':
...: 3.7,'debt':-3,'eastern':False},ignore_index=True)
>>> frame2
   year   state  pop  debt  eastern
0  2000    Ohio  1.5   NaN     True
1  2001    Ohio  1.7  -1.2     True
2  2002    Ohio  3.6   NaN     True
3  2001  Nevada  2.4  -1.5    False
4  2002  Nevada  2.9  -1.7    False
5  2003  Nevada  3.2   NaN    False
6  2004  Nevada  3.7  -2.0    False
7  2004  Nevada  3.7  -3.0    False
# ignore_index: no duplicate index when appending several times
# or concat()


#删除单列 (不会改变原来的表格)
>>> frame2.drop('eastern',axis=1)

#删除多列
>>> frame2.drop(['year','state'], axis=1)
>>> frame2.drop(frame2.columns[[0,3]], axis=1)

#删除连续列
>>> frame2.drop(frame2.columns[list(range(2))],axis=1)


#删除单行
>>> frame2.drop('one',axis=0)

#删除多行
>>> frame2.drop(['one','two'], axis=0)
>>> frame2.drop(frame2.index[[0,3]], axis=0)

#删除连续行
>>> frame2.drop(frame2.index[list(range(2))],axis=0)

14.2 Pandas (2)

Reindex, Arithmetics, Function

Other ways of forming a dataframe

1. By dictionaries nested in a dictionary


>>> pop = {'Nevada': {2001: 2.4, 2002: 2.9},
....: 'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}


>>> frame3 = pd.DataFrame(pop)
>>> frame3
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5

The keys in the inner dicts are combined and sorted to form the index in the result. This isn't true if an explicit index is specified:


#以字典+字典创建时，可以指定index，多出来的以NaN填充（conform）
#以字典+字典创建时，可以指定columns，多出来的以NaN填充（conform）
>>> pd.DataFrame(pop, index=[2001, 2002, 2003])
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN

2. By lists nested in a list


#以列表+列表创建时，index和columns都必须长度匹配
>>> frame3 = pd.DataFrame([[2.4,1.7],[2.9,3.6],[]],
...: columns = ['Nevada', 'Ohio'], index = [2001,2002,2003])
>>> frame3 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     NaN   NaN


>>> frame3.index.name = 'year' 
>>> frame3.columns.name = 'state'
>>> frame3
state  Nevada  Ohio
year               
2001      2.4   1.7
2002      2.9   3.6
2003      NaN   NaN

# 转置transpose
>>> frame3.T
year    2001  2002  2003
state                   
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   NaN

Reindexing (conform)


>>> frame4 = frame3.reindex([2001,2003,2005],fill_value=0)
>>> frame4
state  Nevada  Ohio
year               
2001      2.4   1.7
2003      NaN   NaN
2005      0.0   0.0

>>> frame4.reindex(range(2000,2005), method="ffill")
state  Nevada  Ohio
year               
2000      NaN   NaN
2001      2.4   1.7
2002      2.4   1.7
2003      NaN   NaN
2004      NaN   NaN


>>> frame5 = frame4.reindex(columns=['Nevada', 'Washington'])  
>>> frame5
state  Nevada  Washington
year                     
2001      2.4         NaN
2003      NaN         NaN
2005      0.0         NaN

Operations between Series


>>> obj3
Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64


>>> obj4
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

It automatically aligns by index label in arithmetic operations:


>>> obj3 + obj4
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

Arithmetic Methods


>>> obj3.add(obj4)
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

# add, radd (+), sub, rsub (-)
# div, rdiv (/), floordiv, rfloordiv (//)
# mul, rmul (*), pow, rpow (**)


>>> obj3.add(obj4,fill_value=0)
California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah            5000.0
dtype: float64


# fill_value is for missing items
# not for missing data!

Operations between DataFrames


>>> frame3
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


>>> frame4
      Nevada  Ohio
2001     2.4   1.7
2003     0.0   0.0
2005     0.0   0.0


>>> frame3 + frame4
      Nevada  Ohio
2000     NaN   NaN
2001     4.8   3.4
2002     NaN   NaN
2003     NaN   NaN
2005     NaN   NaN


>>> frame3.add(frame4,fill_value=0)
      Nevada  Ohio
2000     NaN   1.5
2001     4.8   3.4
2002     2.9   3.6
2003     0.0   0.0
2005     0.0   0.0

Operations between DataFrame and Series


>>> frame6 = pd.DataFrame([[2.4,1.7],[2.9,3.6],[2.1,2.5]],
...: columns = ['Nevada', 'Ohio'], index = [2001,2002,2003])
>>> frame6 
      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2003     2.1   2.5


>>> series_row = frame6.iloc[0]
>>> series_row
Nevada    2.4
Ohio      1.7
Name: 2001, dtype: float64


>>> frame6 - series_row
      Nevada  Ohio
2001     0.0   0.0
2002     0.5   1.9
2003    -0.3   0.8


>>> series_col = frame6['Ohio']
>>> series_col
2001    1.7
2002    3.6
2003    2.5
Name: Ohio, dtype: float64


>>> frame6.sub(series_col, axis=0)
# arithmetic methods, axis=0 match index
      Nevada  Ohio
2001     0.7   0.0
2002    -0.7   0.0
2003    -0.4   0.0

Create

Series

以列表创建，index (match)
以字典创建，index (conform)

DataFrame

以字典+列表创建，columns (conform), index (match)
以字典+字典创建，columns, index (conform)
以列表+列表创建，columns, index (match)

Modify

df.index = [...] 暴力修改 (match)
df.columns = [...] 暴力修改 (match)
.reindex([...]) 指定index (conform)
.reindex(columns=[...]) 指定columns (conform)

Summary

Pandas

Reading: Python for Data Analysis, Chapter 5