Predicting Google Analytics Merchandise Stores Sales — Analysis of Google Analytics Data.
Google Analytics is a digital analytics service offered by Google that allows website owners to track and analyse their users’ behaviour. Whereas the free version comes with limited features, the paid version, namely Google Analytics 360, can be linked to Google BigQuery to access granular clickstream data utilising SQL.
Google open-sourced real-world data from its own Google Merchandise Store. This allows us to query data in a customised format from BigQuery and conduct individual analyses based on a typical ecommerce tracking dataset including:
- Traffic source data: information about where website users came across the store. This includes touch points like organic traffic, paid search traffic, social traffic, etc.
- Transactional data: information about transactions and revenue which occured on the Google Merchandise Store website.
- And much more…
The whole sample dataset consists of one year of data. For this particular analysis three months (May — July) and only a fraction of all possible features where used to answer the following questions:
- Which proportion of the overall users is actually buying and what is their average spend?
- How are users and revenue distributed geographically?
- Can the shop’s revenue be predicted based on historical data?
Which proportion of overall users is actually buying and what is their average spend?
During the period of analysis 159.584 users make up for a total of 203.317 sessions. These sessions result in 3.248 transactions with a total revenue of 433.574 USD from the Google merchandise sold in the store.
These numbers tell us already that the conversion rate can be considered low. Only 1.77% of all users are actually buying. This sounds like a small amount, but in fact is quite common for all the ecommerce shops out there.
Having a closer look at the revenue distribution of buyers, we can clearly see that the majority is spending a small amount of money in the store. The histogram shows that the bucket with the lowest spend is by far the largest, resulting in a left skewed distribution. Accordingly, the mean revenue per user is about 150 USD whereas the median is close to 50 USD.
The gap between these two values indicates that our dataset contains outliers, which are messing with the mean in particular. Analysing the boxplot helps us identifying one particular user who decided to spend about 41.810 USD in 17 sessions. To identify what differentiates this “big spender” from regular users might definitely be worth looking into in more detail for upcoming analyses.
How is revenue distributed geographically for the Google Merchandise Store?
The chloropleth map (for an interactive version of the map click here) above lets us quickly identify one country that accounts for almost all of the turnover achieved in the period of analysis. Users from the US are the main driver for the Merchandise Store when it comes to revenue with a total of about 415.000 USD. Canada is on second place with a total of about 6.900 USD. This huge gap already implies how important the US market is for the Google Merchandise Store, since all the remaining countries do not contribute significantly.
Can the shop’s revenue be predicted based on historical data?
To predict the shops daily revenue based on the given dataset with session information the previous analysis and creativity are used to come up with meaningful features for modelling. The features used to tackle the given regression problem are:
- Total Pageviews
- Share of Desktop Sessions
- Share of Sessions originating from the US
- Total Sessions
- Average Time On Site
- Share of Bouncing Sessions
- Share of New Visitors
Two different algorithms, namely Multiple Linear Regression (MLR)and Random Forest Regressor (RF), are tested with the data and compared based on their coefficient of determination. This metric provides us with information about the goodness of fit of our models. The MLR performs considerably better and should be used for future prediction purposes.
Nevertheless, the feature importances of the Random Forest helps us better understand buyers behaviour. Feature importance tells us how well these features support in decreasing the impurity of the resulting subgroups when calculating a Random Forest. E.g. high feature importance for Total Pageviews implies that the information this feature is carrying is strongly contributing to good predictions.
To answer the question posed at the beginning of the paragraph let us have closer look at the following graph showing the predicted revenue per day.
While we see that both models are able to capture the trend of the data, it is also obvious that especially in peak times, e.g. 07/05/2017, predictions are way off from actual revenue. The models build should therefore rather be used for trend analysis than for actual forecasts. In order to improve the models more or even better features are needed. Identifying these from the plethora of Google Analytics tracking data might be an interesting task for further blog posts.
To take a closer look at the actual analysis, visit my Github repo here.