In this tutorial you will learn how to do a simple data analysis using Python with libraries of NumPy and Pandas.
You will learn how to clean your data, deriving new variables from the existing data, and simple data analysis for one and two variables.
First, we need to import our libraries of pandas and numpy.
import pandas as pd
import numpy as np
Suppose we have the data of a day sales from a small fruit store. The data consists of only 10 rows and four columns. Each row represents one transaction of a fruit. The columns are the variables.
We read the data from a CSV file. Notice that the empty cells in the data are automatically converted into NaN (not a number) to indicate the missing values.
fileData='SampleData1.csv'
# Read the data file for analysis
data = pd.read_csv(fileData, low_memory=False)
data
We have two look up tables. One for Object field and another one for Color field.
fileLUTobject='LUTobjects.csv'
LUTobject=pd.read_csv(fileLUTobject, low_memory=False)
LUTobject
fileLUTcolor='LUTcolors.csv'
LUTcolor=pd.read_csv(fileLUTcolor, low_memory=False)
LUTcolor
Here is our plan in data cleaning: we will convert the coded data values into the actual string based on look up table.
idx=LUTobject.ID.values
val=LUTobject.Object.values
data.Object.replace(idx,val,inplace=True)
data
idx=LUTcolor.ID.values
val=LUTcolor.Color.values
data.Color.replace(idx,val,inplace=True)
data
We will create new fields or new variables based on certain rule and the values that exist in the data.
First, suppose we will derive a new variable price based on the following rule that put into the look up table.
fileLUTprice='LUTprice.csv'
LUTprice=pd.read_csv(fileLUTprice, low_memory=False)
LUTprice
We change the name of column from 'Fruit' to 'Object' to match the field name in the data. The change of name is useful for merging the tables.
LUTprice=LUTprice.rename(columns={'Fruit': 'Object'})
LUTprice
Pandas has powerful merge operation that is similar to join in the database. You can have inner join (default), right join, left join, or outer join by specifying the how.
table=pd.merge(data,LUTprice,on=['Object','Color'],how='outer')
table
table.sort_values(by=['ID','Object'])
Suppose the profit was set to be 10% of the price. The Count represents the demand for the fruit. Thus, we can derive several more variables:
table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table
We can compute the total revenue of the day (by ignoring the NaN)
np.nansum(table.Revenue)
The owner of the shop compute his cash for the day and he now realize that his revenue is actually $4004. How many unit is the customer number 7 bought? This was missing value (i.e. NaN) because he forgot to write down.
(4004-3674)/10
Now we corrected our data to replace the missing values.
table.Color.replace(np.nan,'Red',inplace=True)
table.Count.replace(np.nan,33.0,inplace=True)
table
The Revenue and Profit still contain NaN, thus we need to recompute again to get our clean data table that is ready for the analysis.
table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table
Our data table is now completed. The total profit of the day can now be computed
np.sum(table.Profit)
For each variable, we will compute the statistics and draw the distribution, plot the boxplot to identify the outlier. Note that we need to exclude the observation that contain NaN from the analysis, not from the data.
Method value_counts() produces Series data structure of the frequency count of each value.
Distribution of the fruits Object
h=table.Object.value_counts()
h
h.plot(kind='bar',
title='Fruits',
figsize=(2,2),
color="pink", # Plot color
x='Fruit',
y='Frequency'
);
Distributon of Color
h=table.Color.value_counts()
h.sort_values()
Method crosstab() produces Data Frame object. It can be used to get the normalized distribution. the normalization can either be index, columns or all.
g=pd.crosstab(table.Color,columns="Count",normalize='all',colnames=['Variable'])
g
h.plot(kind='bar',
title='Color',
figsize=(2,2),
color="red", # Plot color
x='color',
y='frequency'
);
Distribution of Count
h=table.Count.value_counts()
h.sort_index()
table.hist(column='Count', # Column to plot
figsize=(2,2), # Plot size
color="green", # Plot color
bins=50,
range=(0,100)); # Limit x-axis range
Distribution of Price
h=table.Price.value_counts()
h.sort_index()
Boxplot is useful to check if your data contain outliers.
table.boxplot(column="Price",return_type='axes');
Method describe() produces the statistics of one variable.
table.Revenue.describe()
We will analyze each two variables of interest using cross tabulation and pivot table.
Suppose you want to know the total revenue based on each type of fruit.
g=table.Revenue.groupby(table.Object)
g.sum()
The cross tabulation shows the distribution over two variables. It can be normalized by columns such that the total in each column is 1.
g=pd.crosstab(table.Revenue, table.Object,normalize='columns')
g
Next, we want to see the detail hierarchical grouping based on Object and the Color.
This may sound unreasonable but you want to know if the profit is affected by the color of the fruit.
g=table.Profit.groupby(table.Color)
g.mean()
g=pd.crosstab(table.Profit, table.Color,normalize='all')
g
g=table.Revenue.groupby([table.Object,table.Color])
g.sum()
More sophisticated data analysis of two or more variables is to use Pivot Table.
p=pd.pivot_table(table, values='Revenue',
columns=['Color'],
index=['Object'],
aggfunc=np.sum,
dropna=True)
p
Check: Python for Data Science
Visit www.Revoledu.com for more tutorials in Data Science
Copyright © 2017 Kardi Teknomo
Permission is granted to share this notebook as long as the copyright notice is intact.