The pandas is an open source library for data analysis in Python. Since we will also use NumPy beside pandas, first we need to import the modules.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, HTML
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
Pandas has three main data structures:
Series is one dimensional data similar to an array. Series has two main attributes: index and values.
val = [x for x in range(1, 30, 3)] # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx = [x for x in range(0, 10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
s = pd.Series(val, index=idx)
display(s)
type(s)
The index and the values are not always numbers. It can also be string.
s = pd.Series([15,-3, 9, 6], index=['z','q','c','f'])
s
sort_values() method is useful to sort the values
s.sort_values()
s.sort_values(ascending=False)
To sort the index, use sort_index() method
s.sort_index()
Relabel the index is simply change the index value.
val=[x for x in range(1,30,3)] # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx2=[x for x in range(1,11)] # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
s=pd.Series(val,index=idx2)
s
Method reindex() change the order of the sequence of indexes, delete some of them, or add new ones. In the case of a new label, pandas add NaN as corresponding value.
val=[x for x in range(1,30,3)] # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx=[x for x in range(0,10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
idx2=[x for x in range(1,11)] # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
s=pd.Series(val,index=idx)
s1=s.reindex(idx2)
s1
Accessing the value of a series is by specifying the index
s[4]=12
s
We can filter the series. s1 comes from s with values higher than 10. s2 comes from s with values less than or equal to 16.
s1=s[s>10]
s1
s2=s[s<=16]
s2
s2.index
s2.values
s
s.drop(s2.index) # remove some part of the series based on criteria on values
When we sum the two series, the summation only happens for the items with common index labels. All other index labels from the two series are copied but the values are NaN.
s3=s1+s2
s3
To remove NaN from the data, use dropna() method.
s3.dropna()
Another possibility to remove NaN from the data is to use notnull() method that produce Boolean values for each label.
s3.notnull()
Then we use these Boolean value as the mask into the Series.
s3[s3.notnull()]
To remove a certain data based on the index label, we use drop() method. This operation is useful to exclude certain data from the analysis.
First, we presented the Series again
val=[x for x in range(1,30,3)] # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx2=[x for x in range(1,11)] # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
s=pd.Series(val,index=idx2)
s
Then, we drop index 7 to 10
arr=range(7,11)
s=s.drop(arr)
s
Data Frame is two dimensional data similar to an matrix or a database table. Data Frame has three main attributes: index, values and columns.
Series can be transformed into Data Frame
val=[x for x in range(1,30,3)] # [1, 4, 7, 10, 13, 16, 19, 22, 25, 28]
idx=[x for x in range(0,10)] # [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
s=pd.Series(val,index=idx)
df=s.to_frame(name='val')
display(df)
type(df)
We can create a data frame from NumPy matrix
df=np.arange(1,100,4).reshape(5,5)
df
To know the number of rows and the number of columns, use shape attribute
df.shape
print('Number of rows:',df.shape[0])
print('Number of columns:',df.shape[1])
df1 = pd.DataFrame(df, columns=['Strength','Elasticity','Diameter','Elongation','Depth'])
df1
The column headers names
df1.columns
The number of columns in the data frame
len(df1.columns)
df1.index
Number of rows
len(df1.index)
Header does not change the number of rows and columns.
print('Number of rows:',df1.shape[0])
print('Number of columns:',df1.shape[1])
The values inside dataframe
df1.values
Reindex (change the index name) to 1 to 6 (instead of 0 to 5)
df1.index=np.arange(1,6)
df1
df1.index
We can also create data frame from dictionary
dict1={'color': ['red', 'green', 'yellow'],
'fruit': ['apple','banana','cherry'],
'profile': ['round', 'long', 'round'],
'taste': ['crunchy','sweet','sour']}
df2=pd.DataFrame(dict1,index=np.arange(1,4))
df2
To select one column, you can use df['ColumnName'] or df.ColumnName or df.ix[:,colNum]. The result is a Series.
s=df2.color
type(s)
df2['fruit']
df2.ix[:,0]
To select one particular cell within the data frame, you use df.ColumnName[index] or df['ColumnName'][index].
df2.profile[3]
df2['taste'][2]
To select one particular row wthin the data frame, use df.ix[index] or df.ix[rowNum,:] .
df2.ix[2]
s=df2.ix[2,:]
display(s)
type(s)
Row selection can also use colon notation df[start: end: step]. The start is inclusive (from 0 if omitted), the end is exclusive (until the end of rows if omitted) and the default step value is 1 if omitted.
df2[0:2]
To select multiple rows, provide list of index as argument to ix[index]
df2.ix[[1,3]]
Updating the value is done similar to selection.
print(df2['taste'][2])
df2['taste'][2]='sour'
print(df2['taste'][2])
It is possible to assign the same value for one column. Suppose we create a new column and assign equal initial value.
df2['demand']=20
df2
We can also assign to all the values in one column by assigning to a list of the same length to the number of rows.
df2['demand']=[15,10,30]
df2
print(df2.demand>10)
np.sum(df2.demand>10) # count how manyTrue
df2[df2.demand>10]
df2[df2.taste=='sour']
membership function isin([value, column]) is useful for masking
mask1=df2.isin(['round','profile'])
mask1
df2[mask1]
mask2=df2.isin(['sour','taste'])
mask2
mask3=mask1.values | mask2.values
mask3
To delete a certain cell, set it to NaN
df2['demand'][1]=np.nan
df2
To delete a column use del df['ColumnName']
del df2['demand']
df2
To delete a row, use drop(index) function
df2.drop(3)
df2.drop([1,2])
del and drop() function does not actually delete the data permanently. It is only to remove column or row for the analysis.
df2
Use on=['key1', 'key2'] if the keys we want to join are the same name. If they are non the same name, use left_on=['key1', 'key2'] and right_on=['Key1', 'Key2'].
| Merge method | SQL Join Name | Description | |---------------|-------------------|-------------------------------------------| | left | LEFT OUTER JOIN | Use keys from left frame only | | right | RIGHT OUTER JOIN | Use keys from right frame only | | outer | FULL OUTER JOIN | Use union of keys from both frames | | inner | INNER JOIN | Use intersection of keys from both frames |
Notice that the right table has repeated key (not unique)
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],
'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5']})
right = pd.DataFrame({'key': ['K0', 'K4', 'K3', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on='key') # the same as inner
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, on='key',how='inner')
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, on='key',how='right')
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, on='key',how='left')
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, on='key',how='outer')
print(left,'\n')
print(right,'\n')
print(result)
left = pd.DataFrame({'key1': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
'key2': ['k0', 'k1', 'k2', 'k0', 'k1', 'k2'],
'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5'],
'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B5']})
right = pd.DataFrame({'key1': ['K0', 'K4', 'K3', 'K4'],
'key2': ['k0', 'k1', 'k2', 'k1'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on=['key1', 'key2']) # the same as inner
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, how='inner', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, how='right', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, how='left', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)
result = pd.merge(left, right, how='outer', on=['key1', 'key2'])
print(left,'\n')
print(right,'\n')
print(result)
dc = pd.DataFrame({'A' : [1, 2, 3, -4],'B' : [4, 5, -2, 1],'C' : [2, -4, 1, 6]})
plt.plot(dc)
plt.legend(dc.columns)
dcsummary = pd.DataFrame([dc.mean(), dc.sum()],index=['Mean','Total'])
plt.table(cellText=dcsummary.values,colWidths = [0.25]*len(dc.columns),
rowLabels=dcsummary.index,
colLabels=dcsummary.columns,
cellLoc = 'center', rowLoc = 'center',
loc='top')
fig = plt.gcf()
plt.show()
print(dc.to_string())
HTML(dc.to_html())
display(dc)
dc.style
The following code from https://pandas.pydata.org/pandas-docs/stable/style.html
def color_negative_red(val):
"""
Takes a scalar and returns a string with
the css property `'color: red'` for negative
strings, black otherwise.
"""
color = 'red' if val < 0 else 'black'
return 'color: %s' % color
s = dc.style.applymap(color_negative_red)
s
def highlight_max(data, color='yellow'):
'''
highlight the maximum in a Series or DataFrame
'''
attr = 'background-color: {}'.format(color)
if data.ndim == 1: # Series from .apply(axis=0) or axis=1
is_max = data == data.max()
return [attr if v else '' for v in is_max]
else: # from .apply(axis=None)
is_max = data == data.max().max()
return pd.DataFrame(np.where(is_max, attr, ''),
index=data.index, columns=data.columns)
dc.style.apply(highlight_max)
We can use chain of commands. To highlight the max of each row, use axis=1
dc.style.\
applymap(color_negative_red).\
apply(highlight_max,color='lightgreen', axis=1)
to highlight the max of all values, use axis=None
dc.style.apply(highlight_max, color='darkorange', axis=None)
last update: August 2017
Cite this tutorial as Teknomo.K. (2017) Learning Pandas (http://people.revoledu.com/kardi/tutorial/Python/)
See Also: Python for Data Science
Visit www.Revoledu.com for more tutorials in Data Science
Copyright © 2017 Kardi Teknomo
Permission is granted to share this notebook as long as the copyright notice is intact.