Suppose we have the data of a day sales from a small fruit store. The data consists of only 10 rows and four columns. Each row represents one transaction of a fruit. The columns are the variables.

ID is the transaction ID. The same transaction ID means the same customer purchased more than one items.
Object represents the code of the fruit
Color represent the code of the type of the fruit
Count represent the demand (the amount of items that was sold in a transaction)

We read the data from a CSV file. Notice that the empty cells in the data are automatically converted into NaN (not a number) to indicate the missing values.

fileData='SampleData1.csv'
# Read the data file for analysis
data = pd.read_csv(fileData, low_memory=False)
data

We have two look up tables. One for Object field and another one for Color field.

fileLUTobject='LUTobjects.csv'
LUTobject=pd.read_csv(fileLUTobject, low_memory=False)
LUTobject

fileLUTcolor='LUTcolors.csv'
LUTcolor=pd.read_csv(fileLUTcolor, low_memory=False)
LUTcolor

Data Cleaning¶

Here is our plan in data cleaning: we will convert the coded data values into the actual string based on look up table.

idx=LUTobject.ID.values
val=LUTobject.Object.values
data.Object.replace(idx,val,inplace=True)
data

idx=LUTcolor.ID.values
val=LUTcolor.Color.values
data.Color.replace(idx,val,inplace=True)
data

Feature Engineering: Deriving New Variables¶

We will create new fields or new variables based on certain rule and the values that exist in the data.

First, suppose we will derive a new variable price based on the following rule that put into the look up table.

fileLUTprice='LUTprice.csv'
LUTprice=pd.read_csv(fileLUTprice, low_memory=False)
LUTprice

We change the name of column from 'Fruit' to 'Object' to match the field name in the data. The change of name is useful for merging the tables.

LUTprice=LUTprice.rename(columns={'Fruit': 'Object'})
LUTprice

Pandas has powerful merge operation that is similar to join in the database. You can have inner join (default), right join, left join, or outer join by specifying the how.

table=pd.merge(data,LUTprice,on=['Object','Color'],how='outer')
table

table.sort_values(by=['ID','Object'])

Suppose the profit was set to be 10% of the price. The Count represents the demand for the fruit. Thus, we can derive several more variables:

Revenue = Count * Price
Cost = 90% * Price
Profit = Count * (Price-Cost)

table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table

We can compute the total revenue of the day (by ignoring the NaN)

np.nansum(table.Revenue)

3674.0

The owner of the shop compute his cash for the day and he now realize that his revenue is actually $4004. How many unit is the customer number 7 bought? This was missing value (i.e. NaN) because he forgot to write down.

(4004-3674)/10

33.0

Now we corrected our data to replace the missing values.

table.Color.replace(np.nan,'Red',inplace=True)
table.Count.replace(np.nan,33.0,inplace=True)
table

The Revenue and Profit still contain NaN, thus we need to recompute again to get our clean data table that is ready for the analysis.

table['Revenue']=table.Count*table.Price
table['Cost']=0.9*table.Price
table['Profit']=table.Count*(table.Price-table.Cost)
table

Our data table is now completed. The total profit of the day can now be computed

np.sum(table.Profit)

400.4

Analysis of Each Variable¶

For each variable, we will compute the statistics and draw the distribution, plot the boxplot to identify the outlier. Note that we need to exclude the observation that contain NaN from the analysis, not from the data.

Distribution¶

Method value_counts() produces Series data structure of the frequency count of each value.

Distribution of the fruits Object

h=table.Object.value_counts()
h

Cherry    4
Banana    4
Apple     3
Name: Object, dtype: int64

h.plot(kind='bar',
       title='Fruits',
       figsize=(2,2),
       color="pink",     # Plot color
       x='Fruit',
       y='Frequency'
      );

Distributon of Color

h=table.Color.value_counts()
h.sort_values()

Yellow    2
Green     3
Red       6
Name: Color, dtype: int64

Normalized Distribution¶

Method crosstab() produces Data Frame object. It can be used to get the normalized distribution. the normalization can either be index, columns or all.

g=pd.crosstab(table.Color,columns="Count",normalize='all',colnames=['Variable'])
g

h.plot(kind='bar',
       title='Color',
       figsize=(2,2),
       color="red",     # Plot color
       x='color',
       y='frequency'
      );

Distribution of Count

h=table.Count.value_counts()
h.sort_index()

5.0      1
12.0     2
18.0     1
25.0     1
30.0     1
33.0     1
45.0     2
50.0     1
100.0    1
Name: Count, dtype: int64

table.hist(column='Count',         # Column to plot
    figsize=(2,2),    # Plot size
    color="green",     # Plot color
    bins=50,
    range=(0,100));   # Limit x-axis range

Distribution of Price

h=table.Price.value_counts()
h.sort_index()

5     2
7     2
9     1
10    3
15    1
20    2
Name: Price, dtype: int64

BoxPlot¶

Boxplot is useful to check if your data contain outliers.

table.boxplot(column="Price",return_type='axes');

Univariate Statistics¶

Method describe() produces the statistics of one variable.

table.Revenue.describe()

count      11.000000
mean      364.000000
std       298.739017
min        50.000000
25%       117.000000
50%       330.000000
75%       452.500000
max      1000.000000
Name: Revenue, dtype: float64

Analysis of Two Variables¶

We will analyze each two variables of interest using cross tabulation and pivot table.

GroupBy¶

Suppose you want to know the total revenue based on each type of fruit.

g=table.Revenue.groupby(table.Object)
g.sum()

Object
Apple     1610.0
Banana     609.0
Cherry    1785.0
Name: Revenue, dtype: float64

Cross Tabulation¶

The cross tabulation shows the distribution over two variables. It can be normalized by columns such that the total in each column is 1.

g=pd.crosstab(table.Revenue, table.Object,normalize='columns')
g

Next, we want to see the detail hierarchical grouping based on Object and the Color.

This may sound unreasonable but you want to know if the profit is affected by the color of the fruit.

g=table.Profit.groupby(table.Color)
g.mean()

Color
Green     38.300000
Red       44.083333
Yellow    10.500000
Name: Profit, dtype: float64

g=pd.crosstab(table.Profit, table.Color,normalize='all')
g

g=table.Revenue.groupby([table.Object,table.Color])
g.sum()

Object  Color 
Apple   Green      750.0
        Red        860.0
Banana  Green      399.0
        Yellow     210.0
Cherry  Red       1785.0
Name: Revenue, dtype: float64

Pivot Table¶

More sophisticated data analysis of two or more variables is to use Pivot Table.

p=pd.pivot_table(table, values='Revenue', 
                 columns=['Color'], 
                 index=['Object'],
                 aggfunc=np.sum,
                 dropna=True)
p

References¶

Seabold, Skipper, and Josef Perktold. “Statsmodels: Econometric and statistical modeling with python.” Proceedings of the 9th Python in Science Conference. 2010.

Check: Python for Data Science

Visit www.Revoledu.com for more tutorials in Data Science

Permission is granted to share this notebook as long as the copyright notice is intact.

	ID	Object	Color	Count
0	1	1	1.0	50.0
1	2	1	2.0	25.0
2	2	2	3.0	30.0
3	3	2	1.0	45.0
4	4	2	3.0	12.0
5	4	3	2.0	NaN
6	5	3	2.0	100.0
7	6	3	NaN	45.0
8	7	1	2.0	18.0
9	7	2	1.0	12.0
10	7	3	2.0	5.0

	ID	Object	Color	Count
0	1	Apple	1.0	50.0
1	2	Apple	2.0	25.0
2	2	Banana	3.0	30.0
3	3	Banana	1.0	45.0
4	4	Banana	3.0	12.0
5	4	Cherry	2.0	NaN
6	5	Cherry	2.0	100.0
7	6	Cherry	NaN	45.0
8	7	Apple	2.0	18.0
9	7	Banana	1.0	12.0
10	7	Cherry	2.0	5.0

	ID	Object	Color	Count
0	1	Apple	Green	50.0
1	2	Apple	Red	25.0
2	2	Banana	Yellow	30.0
3	3	Banana	Green	45.0
4	4	Banana	Yellow	12.0
5	4	Cherry	Red	NaN
6	5	Cherry	Red	100.0
7	6	Cherry	NaN	45.0
8	7	Apple	Red	18.0
9	7	Banana	Green	12.0
10	7	Cherry	Red	5.0

Object	Apple	Banana	Cherry
Revenue
50.0	0.000000	0.00	0.25
60.0	0.000000	0.25	0.00
84.0	0.000000	0.25	0.00
150.0	0.000000	0.25	0.00
315.0	0.000000	0.25	0.00
330.0	0.000000	0.00	0.25
360.0	0.333333	0.00	0.00
405.0	0.000000	0.00	0.25
500.0	0.333333	0.00	0.00
750.0	0.333333	0.00	0.00
1000.0	0.000000	0.00	0.25

Color	Green	Red	Yellow
Profit
5.0	0.000000	0.090909	0.000000
6.0	0.000000	0.000000	0.090909
8.4	0.090909	0.000000	0.000000
15.0	0.000000	0.000000	0.090909
31.5	0.090909	0.000000	0.000000
33.0	0.000000	0.090909	0.000000
36.0	0.000000	0.090909	0.000000
40.5	0.000000	0.090909	0.000000
50.0	0.000000	0.090909	0.000000
75.0	0.090909	0.000000	0.000000
100.0	0.000000	0.090909	0.000000

Color	Green	Red	Yellow
Object
Apple	750.0	860.0	NaN
Banana	399.0	NaN	210.0
Cherry	NaN	1785.0	NaN