simple statistical analysis with csv data with Numpy - Numpy tutorial 8 | applied electronics engineering

# simple statistical analysis with csv data with Numpy - Numpy tutorial 8

By Applied Electronics - Friday, May 27, 2016 No Comments
Statistics is an important branch in Mathematics. Statistics gives you information about data that you have collected and find information embedded in them. Numpy is a python package that is optimized to deal with array. It supports number of statistical formula for you to calculate such as mean, median, standard deviation, variance, correlation etc.

If you need to do statistical analysis on your data then you can use Numpy for that. But before you do that, you need data and you need to know how to import data and export data from Numpy. We have shown you how to do this in the previous two blog post.

Consider that you have some data in CSV file called mydata.csv and that you want to perform various statistical analysis. First we have to import the data as follows:

[In] sale, cost = loadtxt(r'D:\mydata.csv', skiprows = 1, delimiter = ",", usecols = (1,2), unpack = True)

We have saved the columns in sale and cost numpy array ndarray objects. We can view them by simply:

[In] sale
[Out]
array([  5.4,   6.5,  23.2,  64.2,   0.2,   7.3,  84.3,   5.2,   9.5,
8.4,  56.2,  65.3,   6.2,   5.3,  67.3,   9.7,   7.5,   2.3,
5. ,  10.3])

[In] cost
[Out]
array([ 9050.,  3400.,  2300.,  6030.,  5030.,  9030.,  2040.,  1020.,
7030.,  5023.,  1003.,  4060.,  3090.,  6540.,  8234.,  2349.,
9843.,  4394.,  8924.,  8524.])

Calculate the number of items

We can use the len( ) function to calculate the number of items in each colcumns.

[In] len(cost)
[Out] 20

[In] len(sale)
[Out] 20

Mean

mean( ) function can be used to calculate the mean of sale and cost:

[In] sale_mean = mean(sale)

[In] sale_mean
[Out] 22.465000000000003

[In] cost_mean = mean(cost)

[In] cost_mean
[Out] 5345.6999999999998

There are also other mean which we can calculate such as volume weighted average price(vwap) and time weighted average price(twap).

Here we show the volume weighted average price(vwap) calculation. The volume weighted average price is calculated using the average( ) function.

[In] vwap = average(cost, weights= sale)

[In] vwap
[Out] 4525.3087024259949

Maximum and Minimum Value

The maximum and minimum values can be found using the max( ) and min( ) functions.

Maximum and minimum value of costs:

[In] cost_max = max(cost)

[In] cost_max
[Out] 9843.0

[In] cost_min = min(cost)

[In] cost_min
[Out] 1003.0

Range

The range can be calculated using the ptp( ) function. It is the difference between maximum and minimum value. ptp stands for peak to peak. For example the range or peak to peak value for the costs column is:

[In] ptp = ptp(cost)

[In] ptp
[Out] 8840.0

Median

The function median( ) calculates the median in Numpy.

[In] cost_median = median(cost)

[In] cost_median
[Out] 5026.5

Standard Deviation

The function std( ) calculates the standard deviation in Numpy.

In] cost_std = std(cost)

[In] cost_std
[Out] 2845.1059927531696

Variance

The function var( ) calculates the variance in Numpy.

[In] cost_var = var(cost)

[In] cost_var
[Out] 8094628.1099999994

Covariance

Covariance among two dataset can be calculated in Numpy using the cov( ) function. This is illustrated with the sales and cost column values.

[In] covar = cov(sale, cost)

[In] covar
[Out]
array([[  7.52131868e+02,  -1.94000953e+04],
[ -1.94000953e+04,   8.52066117e+06]])