Learning Pandas

Table of Contents¶

1 Series

2 Data Frame

2.1 Creating Data Frame

2.7.2 Merge based on one key

2.7.3 Merging based on more than one key

2.8 Formatting a Table

The pandas is an open source library for data analysis in Python. Since we will also use NumPy beside pandas, first we need to import the modules.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Pandas has three main data structures:

Series
Data Frame
Panel

Series¶

Series is one dimensional data similar to an array. Series has two main attributes: index and values.

val = [x for x in range(1, 30, 3)]  # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx = [x for x in range(0, 10)]  # [0, 1, 2,  3,  4,  5,  6,  7,  8,  9]
s = pd.Series(val, index=idx)
display(s)
type(s)

0     1
1     4
2     7
3    10
4    13
5    16
6    19
7    22
8    25
9    28
dtype: int64

pandas.core.series.Series

The index and the values are not always numbers. It can also be string.

s = pd.Series([15,-3, 9, 6], index=['z','q','c','f']) 
s

z    15
q    -3
c     9
f     6
dtype: int64

sort_values() method is useful to sort the values

s.sort_values()

q    -3
f     6
c     9
z    15
dtype: int64

s.sort_values(ascending=False)

z    15
c     9
f     6
q    -3
dtype: int64

To sort the index, use sort_index() method

s.sort_index()

c     9
f     6
q    -3
z    15
dtype: int64

Relabel the index is simply change the index value.

val=[x for x in range(1,30,3)]    # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx2=[x for x in range(1,11)]     # [1, 2, 3,  4,  5,  6,  7,  8,  9, 10]
s=pd.Series(val,index=idx2)
s

1      1
2      4
3      7
4     10
5     13
6     16
7     19
8     22
9     25
10    28
dtype: int64

Method reindex() change the order of the sequence of indexes, delete some of them, or add new ones. In the case of a new label, pandas add NaN as corresponding value.

val=[x for x in range(1,30,3)]   # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx=[x for x in range(0,10)]     # [0, 1, 2,  3,  4,  5,  6,  7,  8,  9]
idx2=[x for x in range(1,11)]    # [1, 2, 3,  4,  5,  6,  7,  8,  9, 10]
s=pd.Series(val,index=idx)
s1=s.reindex(idx2)
s1

1      4.0
2      7.0
3     10.0
4     13.0
5     16.0
6     19.0
7     22.0
8     25.0
9     28.0
10     NaN
dtype: float64

Accessing the value of a series is by specifying the index

s[4]=12
s

0     1
1     4
2     7
3    10
4    12
5    16
6    19
7    22
8    25
9    28
dtype: int64

We can filter the series. s1 comes from s with values higher than 10. s2 comes from s with values less than or equal to 16.

s1=s[s>10] 
s1

4    12
5    16
6    19
7    22
8    25
9    28
dtype: int64

s2=s[s<=16] 
s2

0     1
1     4
2     7
3    10
4    12
5    16
dtype: int64

s2.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

s2.values

array([ 1,  4,  7, 10, 12, 16], dtype=int64)

s

0     1
1     4
2     7
3    10
4    12
5    16
6    19
7    22
8    25
9    28
dtype: int64

s.drop(s2.index)   # remove some part of the series based on criteria on values

6    19
7    22
8    25
9    28
dtype: int64

When we sum the two series, the summation only happens for the items with common index labels. All other index labels from the two series are copied but the values are NaN.

s3=s1+s2
s3

0     NaN
1     NaN
2     NaN
3     NaN
4    24.0
5    32.0
6     NaN
7     NaN
8     NaN
9     NaN
dtype: float64

To remove NaN from the data, use dropna() method.

s3.dropna()

4    24.0
5    32.0
dtype: float64

Another possibility to remove NaN from the data is to use notnull() method that produce Boolean values for each label.

s3.notnull()

0    False
1    False
2    False
3    False
4     True
5     True
6    False
7    False
8    False
9    False
dtype: bool

Then we use these Boolean value as the mask into the Series.

s3[s3.notnull()]

4    24.0
5    32.0
dtype: float64

To remove a certain data based on the index label, we use drop() method. This operation is useful to exclude certain data from the analysis.

First, we presented the Series again

val=[x for x in range(1,30,3)]   # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx2=[x for x in range(1,11)]    # [1, 2, 3,  4,  5,  6,  7,  8,  9, 10]
s=pd.Series(val,index=idx2)
s

1      1
2      4
3      7
4     10
5     13
6     16
7     19
8     22
9     25
10    28
dtype: int64

Then, we drop index 7 to 10

arr=range(7,11)
s=s.drop(arr)
s

1     1
2     4
3     7
4    10
5    13
6    16
dtype: int64

Data Frame¶

Data Frame is two dimensional data similar to an matrix or a database table. Data Frame has three main attributes: index, values and columns.

Creating Data Frame¶

Series can be transformed into Data Frame

val=[x for x in range(1,30,3)]   # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx=[x for x in range(0,10)]     # [0, 1, 2,  3,  4,  5,  6,  7,  8,  9]
s=pd.Series(val,index=idx)
df=s.to_frame(name='val')
display(df)
type(df)

pandas.core.frame.DataFrame

We can create a data frame from NumPy matrix

df=np.arange(1,100,4).reshape(5,5)
df

array([[ 1,  5,  9, 13, 17],
       [21, 25, 29, 33, 37],
       [41, 45, 49, 53, 57],
       [61, 65, 69, 73, 77],
       [81, 85, 89, 93, 97]])

To know the number of rows and the number of columns, use shape attribute

df.shape

(5, 5)

print('Number of rows:',df.shape[0])
print('Number of columns:',df.shape[1])

Number of rows: 5
Number of columns: 5

df1 = pd.DataFrame(df, columns=['Strength','Elasticity','Diameter','Elongation','Depth'])
df1

The column headers names

df1.columns

Index(['Strength', 'Elasticity', 'Diameter', 'Elongation', 'Depth'], dtype='object')

The number of columns in the data frame

len(df1.columns)

5

df1.index

RangeIndex(start=0, stop=5, step=1)

Number of rows

len(df1.index)

5

Header does not change the number of rows and columns.

print('Number of rows:',df1.shape[0])
print('Number of columns:',df1.shape[1])

Number of rows: 5
Number of columns: 5

The values inside dataframe

df1.values

array([[ 1,  5,  9, 13, 17],
       [21, 25, 29, 33, 37],
       [41, 45, 49, 53, 57],
       [61, 65, 69, 73, 77],
       [81, 85, 89, 93, 97]])

Reindex (change the index name) to 1 to 6 (instead of 0 to 5)

df1.index=np.arange(1,6)
df1

df1.index

Int64Index([1, 2, 3, 4, 5], dtype='int64')

We can also create data frame from dictionary

dict1={'color': ['red', 'green', 'yellow'], 
       'fruit': ['apple','banana','cherry'], 
       'profile': ['round', 'long', 'round'], 
       'taste': ['crunchy','sweet','sour']}
df2=pd.DataFrame(dict1,index=np.arange(1,4))
df2

Selection¶

To select one column, you can use df['ColumnName'] or df.ColumnName or df.ix[:,colNum]. The result is a Series.

s=df2.color
type(s)

pandas.core.series.Series

df2['fruit']

1     apple
2    banana
3    cherry
Name: fruit, dtype: object

df2.ix[:,0]

1       red
2     green
3    yellow
Name: color, dtype: object

To select one particular cell within the data frame, you use df.ColumnName[index] or df['ColumnName'][index].

df2.profile[3]

'round'

df2['taste'][2]

'sweet'

To select one particular row wthin the data frame, use df.ix[index] or df.ix[rowNum,:] .

df2.ix[2]

color       green
fruit      banana
profile      long
taste       sweet
Name: 2, dtype: object

s=df2.ix[2,:]
display(s)
type(s)

color       green
fruit      banana
profile      long
taste       sweet
Name: 2, dtype: object

pandas.core.series.Series

Row selection can also use colon notation df[start: end: step]. The start is inclusive (from 0 if omitted), the end is exclusive (until the end of rows if omitted) and the default step value is 1 if omitted.

df2[0:2]

To select multiple rows, provide list of index as argument to ix[index]

df2.ix[[1,3]]

Assignment¶

Updating the value is done similar to selection.

print(df2['taste'][2])
df2['taste'][2]='sour'
print(df2['taste'][2])

sweet
sour

It is possible to assign the same value for one column. Suppose we create a new column and assign equal initial value.

df2['demand']=20
df2

We can also assign to all the values in one column by assigning to a list of the same length to the number of rows.

df2['demand']=[15,10,30]
df2

Filtering¶

print(df2.demand>10)
np.sum(df2.demand>10)  # count how manyTrue

1     True
2    False
3     True
Name: demand, dtype: bool

2

df2[df2.demand>10]

df2[df2.taste=='sour']

Membership¶

membership function isin([value, column]) is useful for masking

mask1=df2.isin(['round','profile'])
mask1

df2[mask1]

mask2=df2.isin(['sour','taste'])
mask2

mask3=mask1.values | mask2.values
mask3

array([[False, False,  True, False, False],
       [False, False, False,  True, False],
       [False, False,  True,  True, False]], dtype=bool)

Delete¶

To delete a certain cell, set it to NaN

df2['demand'][1]=np.nan
df2

To delete a column use del df['ColumnName']

del df2['demand']
df2

To delete a row, use drop(index) function

df2.drop(3)

df2.drop([1,2])

del and drop() function does not actually delete the data permanently. It is only to remove column or row for the analysis.

df2

Merge¶

Use on=['key1', 'key2'] if the keys we want to join are the same name. If they are non the same name, use left_on=['key1', 'key2'] and right_on=['Key1', 'Key2'].

how¶

| Merge method  | SQL Join Name     | Description                               |
|---------------|-------------------|-------------------------------------------|
| left          | LEFT OUTER JOIN   | Use keys from left frame only             |
| right         | RIGHT OUTER JOIN  | Use keys from right frame only            |
| outer         | FULL OUTER JOIN   | Use union of keys from both frames        |
| inner         | INNER JOIN        | Use intersection of keys from both frames |

Merge based on one key¶

Notice that the right table has repeated key (not unique)

left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                       'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],
                       'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5']})
right = pd.DataFrame({'key': ['K0', 'K4', 'K3', 'K3'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(left, right, on='key')   # the same as inner
print(left,'\n')
print(right,'\n')
print(result)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
4  A4  B4  K4
5  A5  B5  K5 

    C   D key
0  C0  D0  K0
1  C1  D1  K4
2  C2  D2  K3
3  C3  D3  K3 

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A3  B3  K3  C2  D2
2  A3  B3  K3  C3  D3
3  A4  B4  K4  C1  D1

result = pd.merge(left, right, on='key',how='inner')
print(left,'\n')
print(right,'\n')
print(result)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
4  A4  B4  K4
5  A5  B5  K5 

    C   D key
0  C0  D0  K0
1  C1  D1  K4
2  C2  D2  K3
3  C3  D3  K3 

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A3  B3  K3  C2  D2
2  A3  B3  K3  C3  D3
3  A4  B4  K4  C1  D1

result = pd.merge(left, right, on='key',how='right')
print(left,'\n')
print(right,'\n')
print(result)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
4  A4  B4  K4
5  A5  B5  K5 

    C   D key
0  C0  D0  K0
1  C1  D1  K4
2  C2  D2  K3
3  C3  D3  K3 

    A   B key   C   D
0  A0  B0  K0  C0  D0
1  A3  B3  K3  C2  D2
2  A3  B3  K3  C3  D3
3  A4  B4  K4  C1  D1

result = pd.merge(left, right, on='key',how='left')
print(left,'\n')
print(right,'\n')
print(result)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
4  A4  B4  K4
5  A5  B5  K5 

    C   D key
0  C0  D0  K0
1  C1  D1  K4
2  C2  D2  K3
3  C3  D3  K3 

    A   B key    C    D
0  A0  B0  K0   C0   D0
1  A1  B1  K1  NaN  NaN
2  A2  B2  K2  NaN  NaN
3  A3  B3  K3   C2   D2
4  A3  B3  K3   C3   D3
5  A4  B4  K4   C1   D1
6  A5  B5  K5  NaN  NaN

result = pd.merge(left, right, on='key',how='outer')
print(left,'\n')
print(right,'\n')
print(result)

    A   B key
0  A0  B0  K0
1  A1  B1  K1
2  A2  B2  K2
3  A3  B3  K3
4  A4  B4  K4
5  A5  B5  K5 

    C   D key
0  C0  D0  K0
1  C1  D1  K4
2  C2  D2  K3
3  C3  D3  K3 

    A   B key    C    D
0  A0  B0  K0   C0   D0
1  A1  B1  K1  NaN  NaN
2  A2  B2  K2  NaN  NaN
3  A3  B3  K3   C2   D2
4  A3  B3  K3   C3   D3
5  A4  B4  K4   C1   D1
6  A5  B5  K5  NaN  NaN

Merging based on more than one key¶

left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                     'key2': ['k0', 'k1', 'k2', 'k0', 'k1', 'k2'],
                       'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],
                       'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5']})
right = pd.DataFrame({'key1': ['K0', 'K4', 'K3', 'K4'],
                      'key2': ['k0', 'k1', 'k2', 'k1'],
                        'C': ['C0', 'C1', 'C2', 'C3'],
                        'D': ['D0', 'D1', 'D2', 'D3']})

result = pd.merge(left, right, on=['key1', 'key2'])   # the same as inner
print(left,'\n')
print(right,'\n')
print(result)

    A   B key1 key2
0  A0  B0   K0   k0
1  A1  B1   K1   k1
2  A2  B2   K2   k2
3  A3  B3   K3   k0
4  A4  B4   K4   k1
5  A5  B5   K5   k2 

    C   D key1 key2
0  C0  D0   K0   k0
1  C1  D1   K4   k1
2  C2  D2   K3   k2
3  C3  D3   K4   k1 

    A   B key1 key2   C   D
0  A0  B0   K0   k0  C0  D0
1  A4  B4   K4   k1  C1  D1
2  A4  B4   K4   k1  C3  D3

result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)

    A   B key1 key2
0  A0  B0   K0   k0
1  A1  B1   K1   k1
2  A2  B2   K2   k2
3  A3  B3   K3   k0
4  A4  B4   K4   k1
5  A5  B5   K5   k2 

    C   D key1 key2
0  C0  D0   K0   k0
1  C1  D1   K4   k1
2  C2  D2   K3   k2
3  C3  D3   K4   k1 

    A   B key1 key2   C   D
0  A0  B0   K0   k0  C0  D0
1  A4  B4   K4   k1  C1  D1
2  A4  B4   K4   k1  C3  D3

result = pd.merge(left, right, how='right', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)

    A   B key1 key2
0  A0  B0   K0   k0
1  A1  B1   K1   k1
2  A2  B2   K2   k2
3  A3  B3   K3   k0
4  A4  B4   K4   k1
5  A5  B5   K5   k2 

    C   D key1 key2
0  C0  D0   K0   k0
1  C1  D1   K4   k1
2  C2  D2   K3   k2
3  C3  D3   K4   k1 

     A    B key1 key2   C   D
0   A0   B0   K0   k0  C0  D0
1   A4   B4   K4   k1  C1  D1
2   A4   B4   K4   k1  C3  D3
3  NaN  NaN   K3   k2  C2  D2

result = pd.merge(left, right, how='left', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)

    A   B key1 key2
0  A0  B0   K0   k0
1  A1  B1   K1   k1
2  A2  B2   K2   k2
3  A3  B3   K3   k0
4  A4  B4   K4   k1
5  A5  B5   K5   k2 

    C   D key1 key2
0  C0  D0   K0   k0
1  C1  D1   K4   k1
2  C2  D2   K3   k2
3  C3  D3   K4   k1 

    A   B key1 key2    C    D
0  A0  B0   K0   k0   C0   D0
1  A1  B1   K1   k1  NaN  NaN
2  A2  B2   K2   k2  NaN  NaN
3  A3  B3   K3   k0  NaN  NaN
4  A4  B4   K4   k1   C1   D1
5  A4  B4   K4   k1   C3   D3
6  A5  B5   K5   k2  NaN  NaN

result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)

    A   B key1 key2
0  A0  B0   K0   k0
1  A1  B1   K1   k1
2  A2  B2   K2   k2
3  A3  B3   K3   k0
4  A4  B4   K4   k1
5  A5  B5   K5   k2 

    C   D key1 key2
0  C0  D0   K0   k0
1  C1  D1   K4   k1
2  C2  D2   K3   k2
3  C3  D3   K4   k1 

     A    B key1 key2    C    D
0   A0   B0   K0   k0   C0   D0
1   A1   B1   K1   k1  NaN  NaN
2   A2   B2   K2   k2  NaN  NaN
3   A3   B3   K3   k0  NaN  NaN
4   A4   B4   K4   k1   C1   D1
5   A4   B4   K4   k1   C3   D3
6   A5   B5   K5   k2  NaN  NaN
7  NaN  NaN   K3   k2   C2   D2

Formatting a Table¶

dc = pd.DataFrame({'A' : [1, 2, 3, -4],'B' : [4, 5, -2, 1],'C' : [2, -4, 1, 6]})

plt.plot(dc)
plt.legend(dc.columns)
dcsummary = pd.DataFrame([dc.mean(), dc.sum()],index=['Mean','Total'])

plt.table(cellText=dcsummary.values,colWidths = [0.25]*len(dc.columns),
          rowLabels=dcsummary.index,
          colLabels=dcsummary.columns,
          cellLoc = 'center', rowLoc = 'center',
          loc='top')
fig = plt.gcf()

plt.show()

print(dc.to_string())

   A  B  C
0  1  4  2
1  2  5 -4
2  3 -2  1
3 -4  1  6

HTML(dc.to_html())

display(dc)

dc.style

The following code from https://pandas.pydata.org/pandas-docs/stable/style.html

def color_negative_red(val):
    """
    Takes a scalar and returns a string with
    the css property `'color: red'` for negative
    strings, black otherwise.
    """
    color = 'red' if val < 0 else 'black'
    return 'color: %s' % color


s = dc.style.applymap(color_negative_red)
s

def highlight_max(data, color='yellow'):
    '''
    highlight the maximum in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if data.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_max = data == data.max()
        return [attr if v else '' for v in is_max]
    else:  # from .apply(axis=None)
        is_max = data == data.max().max()
        return pd.DataFrame(np.where(is_max, attr, ''),
                            index=data.index, columns=data.columns)

dc.style.apply(highlight_max)

We can use chain of commands. To highlight the max of each row, use axis=1

dc.style.\
    applymap(color_negative_red).\
    apply(highlight_max,color='lightgreen', axis=1)

to highlight the max of all values, use axis=None

dc.style.apply(highlight_max, color='darkorange', axis=None)

last update: August 2017

Cite this tutorial as Teknomo.K. (2017) Learning Pandas (http://people.revoledu.com/kardi/tutorial/Python/)

	color	fruit	profile	taste
1	red	apple	round	crunchy
2	green	banana	long	sweet
3	yellow	cherry	round	sour

	color	fruit	profile	taste	demand
1	False	False	True	False	False
2	False	False	False	False	False
3	False	False	True	False	False

	Strength	Elasticity	Diameter	Elongation	Depth
0	1	5	9	13	17
1	21	25	29	33	37
2	41	45	49	53	57
3	61	65	69	73	77
4	81	85	89	93	97

	Strength	Elasticity	Diameter	Elongation	Depth
1	1	5	9	13	17
2	21	25	29	33	37
3	41	45	49	53	57
4	61	65	69	73	77
5	81	85	89	93	97