Google Public Dataset provides an obfuscated dataset of its Merchandise Store. Some of the questions we can ask are:
- Where are the buyers from?
- What kind of platform or technology the buyers are using?
- Can we build a classification model to distinguish between a potential buyer or a window shopper?
We will find out as we dive into the data that the traffic for most of the sessions comes from the browser Chrome, Windows Operating System, and the US.
Later, we will evaluate different classifiers to identify potential buyers.
The store (shop.googlemerchandisestore.com) sells Google-branded merchandise. If you navigate to the store using the information page of ‘Google Analytics Sample’, the URL is extended by source traffic relation parameters, notice below, utm_source, medium, etc.
This I believe corresponds to one of the fields in the dataset of interest.
As you are browsing through, you would see different kinds of products: jackets, pens, water bottles, etc, all, well of-course Google-branded. So, it is quite plausible that the buyers are a narrow niche. I could think of few e.g. employees, purchasers for internal events. I generally buy such branded stuff when I’m attending an event like Google I/O.
As mentioned above, the dataset is from Google Public Dataset and contains obfuscated 12-month records from 2016 to 2017. For this project, I executed a few big queries in the jupyter notebook and downloaded the data of interest in CSV files.
Each row in the dataset relates to an analytics 360 session. Some of the fields we will be using and may need an explanation from the schema are:
transactions : Total number of ecommerce transactions within the session.
fullVisitorId: The unique visitor ID (also known as client ID)
date: The date of the session in YYYYMMDD format.
timeOnSite: Total time of the session expressed in seconds.
visits: The number of sessions (for convenience). This value is 1 for sessions with interaction events. The value is null if there are no interaction events in the session.
medium: The medium of the traffic source. Could be “organic”, “cpc”, “referral”, or the value of the utm_medium URL parameter
sessionQualityDim: An estimate of how close a particular session was to transacting, ranging from 1 to 100, calculated for each session. A value closer to 1 indicates a low session quality, or far from transacting, while a value closer to 100 indicates a high session quality, or very close to transacting. A value of 0 indicates that Session Quality is not calculated for the selected time range.
The graph below shows the sum of the number of transactions grouped by year and month. Dec 2016 has the highest number of transactions, perhaps due to the Christmas holidays.
What is the proportion of visitors making transactions?
Total visitors = 714167
Visitors with transactions sum greater than zero = 10022
Percentage of visitor making a transaction = 0.01
Where are the buyers from?
The graph below shows the continents, countries, cities, regions, and metro, sorted based on the highest total e-commerce transactions.
As you can see, for this dataset, the continent is the Americas, and the country is the US. Some of the graphs are showing ‘not available in dataset’, perhaps those are obfuscated.
The average transactions across these features are below:
The above two groups of graphs show different rankings, Anguilla is an outlier though as there has only been one session originated from that location.
However, let’s compare Atwater with New York. The number of transactions over total sessions from Atwater is higher than NY albeit the absolute numbers being small.
Atwater, total sessions = 8
sessions with total transactions > 0 = 2
percentage = 0.25
New York, total sessions = 26371
sessions with total transactions > 0 = 1507
percentage = 0.05714610746653521
What are the browser platforms used by visitors?
There are about 54 unique browsers with sessions in these 12 months but only a few are generating higher session traffic.
Graph for the top 10 browsers are:
Most of the traffic is from Chrome.
What are the operating systems that visitors are using?
There are about twenty unique Operating Systems amongst all the sessions. Windows seems to have the highest session counts followed by Macs.
What are the traffic sources?
There are about 274 unique sources, some of which includes: google, baidu, (direct), phandroid.com…
But the graph for the top 50 as follows:
What are the mediums of the traffic source?
Medium is the general category of the source traffic, e.g. ‘cpc (cost per click)’, ‘direct’, ‘referral’.
Some of the mediums for Google Analytics are defined as follows:
- organic: unpaid search
- cpc: cost per click, i.e. paid search
- cpm: cost per thousand impressions
- referral: a web referral
- none: direct traffic has a medium of ‘none’
Most of the sessions were direct traffic followed by referral and organic.
Classifying visitors as buyers
Each row in the data corresponds to a session but we are interested in classifying the visitors. To achieve this, the sessions are grouped by fullVisitorId and aggregated on selected features and a new data frame for visitors is created.
The features selected are timeOnSite, visits, medium, country, browser, transactions, and sessionQualityDim.
The categorical features country and browser have a large number of values. To prevent the feature set from exploding, the low-frequency values with respect to transactions are collapsed into ‘Other’. These are then further transformed into dummy variables along with the medium.
‘transaction’ is further translated by clipping values higher than one to 1. The data frame is normalized afterward.
The dataset we have is unbalanced as stated earlier, a very small proportion of visitors are buyers, therefore, undersampling is needed for the models to fit properly.
The graph below shows the scores of different models in classifying the visitor’s data:
Except for the GaussianNB, all the other models appear to be close to each other.
The graph on the left is the confusion matrix for AdaBoostClassifier.
In this model, I believe, higher recall is more important because we would not want to waste the opportunity to focus marketing efforts on potential buyers. And visitors misidentified as buyers would be less harmful.