Data Warehousing and Database: Data Mining Clustering

The Microsoft Clustering algorithm builds clusters of entities as it processes the training
data set. Once the clusters are created, the algorithm analyzes the makeup of each cluster. It looks at the values of each attribute for the entities in the cluster.

Clustering

used to identify natural groupings in data

has no predefined groups, but learns by observation

has no target or outcome to predict

discover a useful summary of data using a collection of unclassified objects and a distance metric.

inputs can be attributes like age, sex, status, income, martial status, number of cars owned etc

k-means or iterative distance based clustering is an effective algorithm to extract clusters from a training set. K is total number of clusters found. Find out the center of the cluster is main thing in k-means.

Examples

Behavioral segmentation:

Segment by purchase history in an online website

Segment by activities on a website, or platform

Define personas based on interests of an individual

Find whether a person is homeowner or not

Inventory categorization:

Group inventory by sales activity
Group inventory by manufacturing metrics

Sorting sensor measurements:

Group images based on activity
Separate audio based on sound
Monitor health based on physical activity

Example

I would like to start several home delivery indian organic cuisine shops in ypsilati, united states. I want to analyze the possible challenges before starting the new business. what do you think the areas i need to analyze.
1) The areas from where indian organic cuisines are being ordered frequently.
2) how many indian organic cuisine shops have to be opened to cover delivery in this specific area (depending on the number of customers)
3) Figure out the locations for the indian organic cuisine shops within all these areas(for keeping a particular distance between my store and delivery points).
4) Find out the locations where I can buy organic products to make variety of cuisines.

Shop_ID	Cuisines	Customers
1	8	81
2	10	70
3	6	63
4	4	45
5	15	76
6	3	49

Table shows the number of cuisines and number of customer data. Using k means, I am going to calculate the range of cuisines and customers.

Find out the optical center and cluster together again by means of distance.

Figured out 3 clusters and the range is given in table below.

Cluster	Cuisines	Customers
C1	0-5	45-50
C2	6-8	63-83
C3	10-15	70-75

I would go with 6-8 cuisines as I will get a good number 0f 63-83 customers. It is easy to make good 6-8 cuisines. As there are a good number of customers for C2, there won't be much wastage of food and obviously I believe I can make profit.

Always use minimum number of clusters which you can easily interpret the results.