Data Warehousing and Database: Association Analysis

Every weekends when I am going to Kroger for buying groceries, these are the mandatory things which you can see in my shopping cart.

1) Milk

2) Bread

3) Egg

4) Spinach

5)Carrot

6)Beans

7)Onion

8)Tomatoes

9)Oranges

10)Grapes

11)Organic bananas

12)Meat

13)Fish

I am pretty sure, majority of people will buy the same things. So, if we can group things, how are we going to group. Can we keep the similar things in the same aisle to make the life of customer easier. Wait! From the normal buying routine, I would like to buy something different. It is winter and I want a lip balm. I searched lip balm on the aisle and found Burt Bees organic lip balm. Great! suddenly, I saw organic moisturizer in the right aisle and eye liner in the left aisle. cool. I pick both products along with my lip balm and put it in my basket.

In every purchase, a customer builds patterns about how products are purchased together. Using this patterns, it is easy to understand customer's shopping behaviors to optimize product layout and cross-sell the right product to right customers. (Market basket analysis)

Discovering association between object within certain business context. How frequently and how strongly something happens together. Recommend another one when customer picks one.

Terminologies in Association Analysis

1) Rules govern the layout of your data. A rule is a statement such as ‘‘If it sounds like a cat and looks like a cat, it is (probably) a cat’’

When a customer buys a mobile phone, he will also be likely to buy a case for mobile phone 80% of the time. This behavior can be found in 30% of all purchases.

Here Mobile phone is leading item. case of mobile is depending item. 80% is confidence and 30% is the support.

2) A predicate is a simple condition (such as ‘‘sounds like a cat’’) that describes
the value of one of the attributes of the objects being analyzed or the presence (or absence) of a product in a shopping basket.

Eg: mobile phone = Existing (in the shopping basket)
case for mobile phone = Existing (in the shopping basket)

3) A predicate that participates in a rule is called an item. Consequently, a set of such predicates is called an itemset. Therefore, a rule can be described as a
pair containing items in the left aisle (eye liner) and a items in the right aisle (moisturizer). So what about a bundle of lip balm, moisturizer and eye liner.

A recommendation engine should be able to make recommendation for items
that are likely to be purchased by a customer based on previous purchases
by the same customer.

Source: Amazon

NB: No Need for Case Level Columns in Association Analysis. Need to add Nested Table. Add the measure group dimension as the field to predict. No need to add nested table columns. Percentage of data for testing is 0. Nested table (which describes the purchases for each transaction) must be both input and predictable.

3) Confidence - It measures how dependent a particular item on other.

Eg:

On thanksgiving day, 500,000 transactions happened in Amazon and out of that, 15,000 echodots sold. Also 25,000 TP-Link smart plug sold. 10,000 transactions contain both echodot and TP-Link smart plug .

Confidence of echodot->Tp-Link smart plug= Items sold together/echodot transaction = 10,000 / 15,000 à .6667 or 66.67%

Confidence of Tp-Link smart plug->echodot= Items sold together/Tp-Link smart plug transaction

= 10,000 / 25,000à . 40 or 40%
“When people buy echodot, they also buy Tp-Link smart plug 66.67% of the time”
“When people buy Tp-Link smart plug, they also buy echodot 40% of the time”.

4) Support - it means how often items occur together. = items sold together/ total transactions = 10,000/500,000=2%

People who buy echodot and Tp-Link smart plug 2% of the time.

Support is used to measure the popularity of an itemset. Support of an itemset {A, B} is made up of the total number of transactions that contain both A and B, and is defined as follows:
Support ({A, B}) = NumberofTransactions(A, B)

Minimum Support is a threshold parameter you can specify before processing
an association model. It means that you are interested in only the itemsets and
rules that represent at least minimum support of the data set. The parameter
Minimum Support is used to restrict the itemset, but not rules. It represents the number of cases for the frequency threshold of an itemset.

5) Lift - lift ratio is the ratio of confidence to expected confidence. Expected confidence is the confidence divided by the frequency of B. The Lift tells us how much better a rule is at predicting the result than just assuming the result in the first place.
lift = confidence/NumberofTransactions(A)
it is the measure of the strength of an effect = confidence %/ support%= if confidence is 66.67% and support is 2%, lift = 66.67/2 = 33.33

when a customer buy echodot, he will be 33.33 times more likely to buy Tp-Link smart plug.

If lift is higher, businesses have higher impact

if lift is lower, businesses have lower impact

6) Itemsets- one or more items sold together. An itemset is a set of items. Each item is an attribute value. In the case of market basket analysis, an itemset would contain a set of products such as cake, Pepsi, and milk. Each itemset has a size, which is the number of items contained in the itemset. The size of itemset {Cake, Pepsi, Milk} is 3. The popularity threshold for an itemset is defined using support.

7) Importance - it is the Microsoft version of lift. If importance is higher, rule is valuable. Importance is also called the interesting score (or the lift in some literature).

Importance can be used to measure itemsets and rules.
The importance of an itemset is defined using the following formula:
Importance ({A,B}) = Probability (A, B)/(Probability (A)* Probability (B))

If importance = 1, A and B are independent items. It means that the purchase of product A and the purchase of product B are two independent events. If importance < 1, A and B are negatively correlated, which means that if a customer buys A, it is unlikely he or she will also buy B. If importance > 1, A and B are positively correlated, which means that if a customer buys A, it is very likely he or she also buys B.

An importance of 0 means that there is no association between A and B. A positive importance score means that the probability of B goes up when A is true. A negative importance score means that the probability of B goes down when A is true.

8)Probability- it is the Microsoft version of support. if probability is 1, it is going to happen for sure.

Interpreting the model

After the association model is processed, you can browse the contents of the model using the Association Rules viewer. This viewer contains three tabs:

Itemsets, Rules, and Dependency Network.

The Itemsets tab displays the frequent itemsets discovered by the Microsoft Association Rules algorithm. The main part of the screen is a grid that shows the list of frequent itemsets and their supports and sizes. If Minimum Support is set too low, this list can be quite long. The Itemsets view includes drop-down lists that enable you to filter these itemsets based on support and itemset size. You can also use the Filter Itemset drop-down option to filter the itemsets. For example, you could select only the itemsets that contain Product=Echodot.

The Rules tab displays the qualified association rules. The main part of the tab is the rule grid. It displays all qualified rules, their probabilities, and their importance scores. The importance score is designed to measure the usefulness of a rule. The higher the importance score, the better the quality of the rule is. Similar to the Itemsets tab, the Rules tab contains

some drop-down lists and text files for filtering rules. For example, you can select all rules that contain Product=Echodot.

The third tab of the association is the Dependency Net view. Each edge represents a pairwise association rule. The slider is associated with the importance score. By default, it displays up to 60 nodes.

CONCLUSION

A very efficient rule should satisfy the following conditions:

The item set should exceed minimum support determined based on the business need.

Should exceed the minimum confidence.

Should have greater Lift Ratio (confidence/support) OR importance

Probability 1 means going to happen for sure.

Use either confidence/support or importance to interpret results. Never match support and importance or confidence and importance.

A Marketer would consider rules with high Lift ratio, high Confidence and good support. For example,

IF flour, root vegetables and whipped/sour cream are purchased THEN whole milk is also purchased. This rule has 100% confidence. The confidence of 100% tells us that this rule appears to be a very promising rule for the business.

High probability and high level of Importance determines highest recommendation.
Pick items with high support to find high chance of bundles sold together in an item set.

Use only importance or use confidence/support. Never match support and importance or confidence or importance.

Association Analysis and Clustering are data mining tasks that do not have a target (dependent) variable. Affinity analysis is another term that refers to association analysis and is typically used for market basket analysis (MBA) although association analysis can be used for other areas of study. MBA is essentially analyzing what purchases tend to be purchased together—that is what items tend to have an affinity with other items. Clustering, having no target variable, algorithms attempt to put records into groups based on the record’s attributes. The critical concept is that of similarity—those within a cluster are very similar to each other and not similar with those in another cluster.
Note—because these data mining tasks do not have a target variable, their corresponding models cannot be used for prediction. Thus, they are many times exploratory in nature and their results can be used downstream in predictive models.

Data Warehousing and Database

Saturday, December 16, 2017

Association Analysis

No comments:

Post a Comment

GEN AI

Total Pageviews

My Favourites