Note: if you receive a utf-8 decode error, set encoding = 'latin1'
in pd.read_csv()
.
This section might seem a bit tedious to go through. But I've thought of it as some kind of a simulation of problems one might encounter when dealing with data and other people. Besides there is a prize at the end (i.e. Section 8).
(But feel free to jump right ahead into Section 8 if you want; it doesn't require that you finish this section.)
UnitPrice
¶
online_rt
for CustomerID
s 12346.0 and 12347.0.¶
To reiterate the question that we were dealing with:
"Create a scatterplot with the Quantity per UnitPrice by CustomerID for the top 3 Countries"
The question is open to a set of different interpretations. We need to disambiguate.
We could do a single plot by looking at all the data from the top 3 countries. Or we could do one plot per country. To keep things consistent with the rest of the exercise, let's stick to the latter oprion. So that's settled.
But "top 3 countries" with respect to what? Two answers suggest themselves: Total sales volume (i.e. total quantity sold) or total sales (i.e. revenue). This exercise goes for sales volume, so let's stick to that.
Now that we have the top 3 countries, we can focus on the rest of the problem:
"Quantity per UnitPrice by CustomerID".
We need to unpack that.
"by CustomerID" part is easy. That means we're going to be plotting one dot per CustomerID's on our plot. In other words, we're going to be grouping by CustomerID.
"Quantity per UnitPrice" is trickier. Here's what we know:
One axis will represent a Quantity assigned to a given customer. This is easy; we can just plot the total Quantity for each customer.
The other axis will represent a UnitPrice assigned to a given customer. Remember a single customer can have any number of orders with different prices, so summing up prices isn't quite helpful. Besides it's not quite clear what we mean when we say "unit price per customer"; it sounds like price of the customer! A reasonable alternative is that we assign each customer the average amount each has paid per item. So let's settle that question in that manner.
Revenue
calculate the revenue (Quantity * UnitPrice) from each sale.¶We will use this later to figure out an average price per customer.
CustomerID
and Country
and find out the average price (AvgPrice
) each customer spends per unit.¶
We aren't much better-off than what we started with. The data are still extremely scattered around and don't seem quite informative.
But we shouldn't despair! There are two things to realize:
So: we should plot the data regardless of Country
and hopefully see a less scattered graph.
CustomerID
on a single graph¶
Did Step 7 give us any insights about the data? Sure! As average price increases, the quantity ordered decreses. But that's hardly surprising. It would be surprising if that wasn't the case!
Nevertheless the rate of drop in quantity is so drastic, it makes me wonder how our revenue changes with respect to item price. It would not be that surprising if it didn't change that much. But it would be interesting to know whether most of our revenue comes from expensive or inexpensive items, and how that relation looks like.
That is what we are going to do now.
UnitPrice
by intervals of 1 for prices [0,50), and sum Quantity
and Revenue
.¶
x-axis needs values.
y-axis isn't that easy to read; show in terms of millions.