Clickstream Clustering

Through clickstream analysis, we can answer critical questions like

  1. Who buyers are?
  2. What they are trying to accomplish?
  3. What goals drive their behaviour?
  4. How they think?
  5. How they buy?
  6. What are the underlying patterns of their thinking and buying?
  7. Why they make buying decisions?
  8. What are the major behavioural categories?
  9. Which behavior is more prevalent?
  10. What’s the relationship between different types of behavior?

Challenge 2: Segmenting Buyer Persona

Solution: Unsupervised Clickstream Clustering We will build an unsupervised system to capture dominating user behaviours from clickstream data (traces of users’ click events) and visualize the detected behaviours in an intuitive manner. Our model will identify “clusters” of similar users by partitioning a similarity graph (nodes are users; edges are weighted by clickstream similarity). The partitioning process leverages iterative feature pruning to capture the natural hierarchy within user clusters and produce intuitive features for visualizing and understanding captured user behaviours. Additionally, these clusters are transformed into a text that describes in natural language the main characteristics of the buyer. Both Cluster Analysis (Divisive Hierarchical Clustering) and RFM Analysis would be leveraged for segmenting the buyer persona.

Solution 1: Pattern Analysis

Association Rules

Association rules are an important class of regularities in data. Its objective is to find all co-occurrence relationships, called associations, among data items.

Apriori Algorithm 

The Apriori algorithm works in two steps:

Step 1: Generate all frequent itemsets: A frequent itemset is an itemset that has transaction support above minsup.

Step 2: Generate all confident association rules from the frequent itemsets: A confident association rule is a rule with confidence above minconf.

The Apriori algorithm relies on the Apriori or downward closure property to efficiently generate all frequent itemsets. Downward closure property: If an itemset has minimum support, then every non-empty subset of this itemset also has minimum support.

MSApriori Algorithm

The key operation in MSApriori algorithm is the sorting of the items in I in ascending order of their Minimum item Support (MIS) values. This order is fixed and used in all subsequent operations of the algorithm. The items in each itemset follow this order.

Sequential Patterns

Association rule mining does not consider the order of transactions. However, in many applications such orderings are significant.

There exist several sequential pattern mining algorithms. Some of the classic algorithms for sequential pattern mining are PrefixSpan, Spade, SPAM, and GSP. However, in the recent decade, several novel and more efficient algorithms have been proposed such as CM-SPADE and CM-SPAM (2014), FCloSM and FGenSM (2017), to name a few.  Besides, numerous algorithms have been proposed for extensions of the problem of sequential pattern mining such as finding the sequential patterns that generate the most profit (high utility sequential pattern mining). A few extensions are discussed below for your ready reference.

  • Discovering thetop-k sequential rules.   The idea is to discover the k most frequent rules in a dataset having at least a confidence no less than minconf. For example, a user may specify that he wants to find the top 1000 rules having a confidence of at least 75 %. Some algorithms for this task are TopSeqRules and
  • Discovering sequential rules with a window size constraint. This algorithm let the user find rules of the form X -> Y where X and Y must be close to each other with respect to time. For example, a user may want to find rules appearing whithin three consecutive itemsets in sequences. This is interesting for example for analyzing sequence of web clicks. An algorithm for this task is TRuleGrowth.
  • Discovering high-utility sequential rules. Another extension is to discover rules where items may be annotated with quantities in sequences and each item may have a unit profit. For example, we may have a sequence where a customer bought three breads, then two apples and two bottle of milk and these items may have some unit profit of 1$, 2$ and 1.50$. The goal of high-utility sequential rule miningis to find rules that generate a high profit and have a high confidence (high-utility rules).

The right algorithm/s for association rules and sequential pattern mining would be selected after a thorough understanding of the business and data.

Solution 2: Segmenting Buyer Persona

Clickstream Clustering

At a high level, we will map users into a similarity graph where nodes are users and edges are weighed on the similarity between user clickstreams. We partition the similarity graph to identify clusters of users with similar clickstream activities.

Formatting User Clickstream

For each user, we will gather all her click events to form a single clickstream. It is a sequence of discrete events sorted by the order of arrival, capturing both different event types in the stream and the magnitude of time gaps between them.

Clickstream Similarity Graph

The clustering algorithm would be based on similarity graph, where each node is a user and edges are weighted on similarity between two users’ clickstreams. The behavioural clusters would be identified by partitioning the similarity graph.

Iterative Feature Pruning

To capture the hierarchical structure, we will recursively partition newly generated clusters, while pruning the feature set used to measure clickstream similarity. Intuitively, by identifying and pruning dominating features in higher-level clusters, we will allow the secondary features to manifest and discover more fine-grained sub-clusters. Also, the pruned features are indicative of why this cluster is formed, which can help service providers to understand the behavioural model.

Cluster Visualization

We will visualize clusters to examine and understand user behavioural clusters generated by our algorithm.

RFM Analysis

RFM (Recency, Frequency, Monetary) groups customers based on their transaction history – how recently, how often and how much did they buy. RFM helps divide customers into various categories or clusters to identify customers for future personalization services.