Simple Data Analysis using Python

In this tutorial you will learn how to do a simple data analysis using Python with libraries of NumPy and Pandas.

• NumPy is a package for scientific computing with Python.
• Pandas is an open source library for data analysis in Python.

You will learn how to clean your data, deriving new variables from the existing data, and simple data analysis for one and two variables.

First, we need to import our libraries of pandas and numpy.

In [142]:
import pandas as pd
import numpy as np


# The Data¶

Suppose we have the data of a day sales from a small fruit store. The data consists of only 10 rows and four columns. Each row represents one transaction of a fruit. The columns are the variables.

• ID is the transaction ID. The same transaction ID means the same customer purchased more than one items.
• Object represents the code of the fruit
• Color represent the code of the type of the fruit
• Count represent the demand (the amount of items that was sold in a transaction)

We read the data from a CSV file. Notice that the empty cells in the data are automatically converted into NaN (not a number) to indicate the missing values.

In [2]:
fileData='SampleData1.csv'
# Read the data file for analysis
data

Out[2]:
ID Object Color Count
0 1 1 1.0 50.0
1 2 1 2.0 25.0
2 2 2 3.0 30.0
3 3 2 1.0 45.0
4 4 2 3.0 12.0
5 4 3 2.0 NaN
6 5 3 2.0 100.0
7 6 3 NaN 45.0
8 7 1 2.0 18.0
9 7 2 1.0 12.0
10 7 3 2.0 5.0

We have two look up tables. One for Object field and another one for Color field.

In [3]:
fileLUTobject='LUTobjects.csv'
LUTobject

Out[3]:
ID Object
0 0 NaN
1 1 Apple
2 2 Banana
3 3 Cherry
In [4]:
fileLUTcolor='LUTcolors.csv'
LUTcolor

Out[4]:
ID Color
0 0 NaN
1 1 Green
2 2 Red
3 3 Yellow

# Data Cleaning¶

Here is our plan in data cleaning: we will convert the coded data values into the actual string based on look up table.

In [5]:
idx=LUTobject.ID.values
val=LUTobject.Object.values
data.Object.replace(idx,val,inplace=True)
data

Out[5]:
ID Object Color Count
0 1 Apple 1.0 50.0
1 2 Apple 2.0 25.0
2 2 Banana 3.0 30.0
3 3 Banana 1.0 45.0
4 4 Banana 3.0 12.0
5 4 Cherry 2.0 NaN
6 5 Cherry 2.0 100.0
7 6 Cherry NaN 45.0
8 7 Apple 2.0 18.0
9 7 Banana 1.0 12.0
10 7 Cherry 2.0 5.0
In [6]:
idx=LUTcolor.ID.values
val=LUTcolor.Color.values
data.Color.replace(idx,val,inplace=True)
data

Out[6]:
ID Object Color Count
0 1 Apple Green 50.0
1 2 Apple Red 25.0
2 2 Banana Yellow 30.0
3 3 Banana Green 45.0
4 4 Banana Yellow 12.0
5 4 Cherry Red NaN
6 5 Cherry Red 100.0
7 6 Cherry NaN 45.0
8 7 Apple Red 18.0
9 7 Banana Green 12.0
10 7 Cherry Red 5.0

# Feature Engineering: Deriving New Variables¶

We will create new fields or new variables based on certain rule and the values that exist in the data.

First, suppose we will derive a new variable price based on the following rule that put into the look up table.

In [7]:
fileLUTprice='LUTprice.csv'
LUTprice

Out[7]:
Fruit Color Price
0 Apple Green 15
1 Apple Red 20
2 Banana Green 7
3 Banana Yellow 5
4 Cherry Red 10
5 Cherry NaN 9

We change the name of column from 'Fruit' to 'Object' to match the field name in the data. The change of name is useful for merging the tables.

In [8]:
LUTprice=LUTprice.rename(columns={'Fruit': 'Object'})
LUTprice

Out[8]:
Object Color Price
0 Apple Green 15
1 Apple Red 20
2 Banana Green 7
3 Banana Yellow 5
4 Cherry Red 10
5 Cherry NaN 9

Pandas has powerful merge operation that is similar to join in the database. You can have inner join (default), right join, left join, or outer join by specifying the how.

In [9]:
table=pd.merge(data,LUTprice,on=['Object','Color'],how='outer')
table

Out[9]:
ID Object Color Count Price
0 1 Apple Green 50.0 15
1 2 Apple Red 25.0 20
2 7 Apple Red 18.0 20
3 2 Banana Yellow 30.0 5
4 4 Banana Yellow 12.0 5
5 3 Banana Green 45.0 7
6 7 Banana Green 12.0 7
7 4 Cherry Red NaN 10
8 5 Cherry Red 100.0 10
9 7 Cherry Red 5.0 10
10 6 Cherry NaN 45.0 9
In [10]:
table.sort_values(by=['ID','Object'])

Out[10]:
ID Object Color Count Price
0 1 Apple Green 50.0 15
1 2 Apple Red 25.0 20
3 2 Banana Yellow 30.0 5
5 3 Banana Green 45.0 7
4 4 Banana Yellow 12.0 5
7 4 Cherry Red NaN 10
8 5 Cherry Red 100.0 10
10 6 Cherry NaN 45.0 9
2 7 Apple Red 18.0 20
6 7 Banana Green 12.0 7
9 7 Cherry Red 5.0 10

Suppose the profit was set to be 10% of the price. The Count represents the demand for the fruit. Thus, we can derive several more variables:

1. Revenue = Count * Price
2. Cost = 90% * Price
3. Profit = Count * (Price-Cost)
In [11]:
table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table

Out[11]:
ID Object Color Count Price Revenue Cost Profit
0 1 Apple Green 50.0 15 750.0 13.5 75.0
1 2 Apple Red 25.0 20 500.0 18.0 50.0
2 7 Apple Red 18.0 20 360.0 18.0 36.0
3 2 Banana Yellow 30.0 5 150.0 4.5 15.0
4 4 Banana Yellow 12.0 5 60.0 4.5 6.0
5 3 Banana Green 45.0 7 315.0 6.3 31.5
6 7 Banana Green 12.0 7 84.0 6.3 8.4
7 4 Cherry Red NaN 10 NaN 9.0 NaN
8 5 Cherry Red 100.0 10 1000.0 9.0 100.0
9 7 Cherry Red 5.0 10 50.0 9.0 5.0
10 6 Cherry NaN 45.0 9 405.0 8.1 40.5

We can compute the total revenue of the day (by ignoring the NaN)

In [12]:
np.nansum(table.Revenue)

Out[12]:
3674.0

The owner of the shop compute his cash for the day and he now realize that his revenue is actually \$4004. How many unit is the customer number 7 bought? This was missing value (i.e. NaN) because he forgot to write down.

In [13]:
(4004-3674)/10

Out[13]:
33.0

Now we corrected our data to replace the missing values.

In [14]:
table.Color.replace(np.nan,'Red',inplace=True)
table.Count.replace(np.nan,33.0,inplace=True)
table

Out[14]:
ID Object Color Count Price Revenue Cost Profit
0 1 Apple Green 50.0 15 750.0 13.5 75.0
1 2 Apple Red 25.0 20 500.0 18.0 50.0
2 7 Apple Red 18.0 20 360.0 18.0 36.0
3 2 Banana Yellow 30.0 5 150.0 4.5 15.0
4 4 Banana Yellow 12.0 5 60.0 4.5 6.0
5 3 Banana Green 45.0 7 315.0 6.3 31.5
6 7 Banana Green 12.0 7 84.0 6.3 8.4
7 4 Cherry Red 33.0 10 NaN 9.0 NaN
8 5 Cherry Red 100.0 10 1000.0 9.0 100.0
9 7 Cherry Red 5.0 10 50.0 9.0 5.0
10 6 Cherry Red 45.0 9 405.0 8.1 40.5

The Revenue and Profit still contain NaN, thus we need to recompute again to get our clean data table that is ready for the analysis.

In [15]:
table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table

Out[15]:
ID Object Color Count Price Revenue Cost Profit
0 1 Apple Green 50.0 15 750.0 13.5 75.0
1 2 Apple Red 25.0 20 500.0 18.0 50.0
2 7 Apple Red 18.0 20 360.0 18.0 36.0
3 2 Banana Yellow 30.0 5 150.0 4.5 15.0
4 4 Banana Yellow 12.0 5 60.0 4.5 6.0
5 3 Banana Green 45.0 7 315.0 6.3 31.5
6 7 Banana Green 12.0 7 84.0 6.3 8.4
7 4 Cherry Red 33.0 10 330.0 9.0 33.0
8 5 Cherry Red 100.0 10 1000.0 9.0 100.0
9 7 Cherry Red 5.0 10 50.0 9.0 5.0
10 6 Cherry Red 45.0 9 405.0 8.1 40.5

Our data table is now completed. The total profit of the day can now be computed

In [16]:
np.sum(table.Profit)

Out[16]:
400.4

# Analysis of Each Variable¶

For each variable, we will compute the statistics and draw the distribution, plot the boxplot to identify the outlier. Note that we need to exclude the observation that contain NaN from the analysis, not from the data.

## Distribution¶

Method value_counts() produces Series data structure of the frequency count of each value.

Distribution of the fruits Object

In [72]:
h=table.Object.value_counts()
h

Out[72]:
Cherry    4
Banana    4
Apple     3
Name: Object, dtype: int64
In [71]:
h.plot(kind='bar',
title='Fruits',
figsize=(2,2),
color="pink",     # Plot color
x='Fruit',
y='Frequency'
);


Distributon of Color

In [120]:
h=table.Color.value_counts()
h.sort_values()

Out[120]:
Yellow    2
Green     3
Red       6
Name: Color, dtype: int64

## Normalized Distribution¶

Method crosstab() produces Data Frame object. It can be used to get the normalized distribution. the normalization can either be index, columns or all.

In [119]:
g=pd.crosstab(table.Color,columns="Count",normalize='all',colnames=['Variable'])
g

Out[119]:
Variable Count
Color
Green 0.272727
Red 0.545455
Yellow 0.181818
In [121]:
h.plot(kind='bar',
title='Color',
figsize=(2,2),
color="red",     # Plot color
x='color',
y='frequency'
);


Distribution of Count

In [87]:
h=table.Count.value_counts()
h.sort_index()

Out[87]:
5.0      1
12.0     2
18.0     1
25.0     1
30.0     1
33.0     1
45.0     2
50.0     1
100.0    1
Name: Count, dtype: int64
In [89]:
table.hist(column='Count',         # Column to plot
figsize=(2,2),    # Plot size
color="green",     # Plot color
bins=50,
range=(0,100));   # Limit x-axis range



Distribution of Price

In [86]:
h=table.Price.value_counts()
h.sort_index()

Out[86]:
5     2
7     2
9     1
10    3
15    1
20    2
Name: Price, dtype: int64

## BoxPlot¶

Boxplot is useful to check if your data contain outliers.

In [80]:
table.boxplot(column="Price",return_type='axes');


## Univariate Statistics¶

Method describe() produces the statistics of one variable.

In [91]:
table.Revenue.describe()

Out[91]:
count      11.000000
mean      364.000000
std       298.739017
min        50.000000
25%       117.000000
50%       330.000000
75%       452.500000
max      1000.000000
Name: Revenue, dtype: float64

# Analysis of Two Variables¶

We will analyze each two variables of interest using cross tabulation and pivot table.

## GroupBy¶

Suppose you want to know the total revenue based on each type of fruit.

In [95]:
g=table.Revenue.groupby(table.Object)
g.sum()

Out[95]:
Object
Apple     1610.0
Banana     609.0
Cherry    1785.0
Name: Revenue, dtype: float64

## Cross Tabulation¶

The cross tabulation shows the distribution over two variables. It can be normalized by columns such that the total in each column is 1.

In [123]:
g=pd.crosstab(table.Revenue, table.Object,normalize='columns')
g

Out[123]:
Object Apple Banana Cherry
Revenue
50.0 0.000000 0.00 0.25
60.0 0.000000 0.25 0.00
84.0 0.000000 0.25 0.00
150.0 0.000000 0.25 0.00
315.0 0.000000 0.25 0.00
330.0 0.000000 0.00 0.25
360.0 0.333333 0.00 0.00
405.0 0.000000 0.00 0.25
500.0 0.333333 0.00 0.00
750.0 0.333333 0.00 0.00
1000.0 0.000000 0.00 0.25

Next, we want to see the detail hierarchical grouping based on Object and the Color.

This may sound unreasonable but you want to know if the profit is affected by the color of the fruit.

In [24]:
g=table.Profit.groupby(table.Color)
g.mean()

Out[24]:
Color
Green     38.300000
Red       44.083333
Yellow    10.500000
Name: Profit, dtype: float64
In [100]:
g=pd.crosstab(table.Profit, table.Color,normalize='all')
g

Out[100]:
Color Green Red Yellow
Profit
5.0 0.000000 0.090909 0.000000
6.0 0.000000 0.000000 0.090909
8.4 0.090909 0.000000 0.000000
15.0 0.000000 0.000000 0.090909
31.5 0.090909 0.000000 0.000000
33.0 0.000000 0.090909 0.000000
36.0 0.000000 0.090909 0.000000
40.5 0.000000 0.090909 0.000000
50.0 0.000000 0.090909 0.000000
75.0 0.090909 0.000000 0.000000
100.0 0.000000 0.090909 0.000000
In [23]:
g=table.Revenue.groupby([table.Object,table.Color])
g.sum()

Out[23]:
Object  Color
Apple   Green      750.0
Red        860.0
Banana  Green      399.0
Yellow     210.0
Cherry  Red       1785.0
Name: Revenue, dtype: float64

## Pivot Table¶

More sophisticated data analysis of two or more variables is to use Pivot Table.

In [140]:
p=pd.pivot_table(table, values='Revenue',
columns=['Color'],
index=['Object'],
aggfunc=np.sum,
dropna=True)
p

Out[140]:
Color Green Red Yellow
Object
Apple 750.0 860.0 NaN
Banana 399.0 NaN 210.0
Cherry NaN 1785.0 NaN

# References¶

• Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.

Check: Python for Data Science

Visit www.Revoledu.com for more tutorials in Data Science