Data mining (Knowledge
discovery in databases (KDD)): Use of a complex mathematical algorithm to sift through detail
data to identify patterns, correlations, and clustering within the data. It is
the act of excavation in the data from which patterns can be extracted. It is
the process of analyzing data to find hidden patterns using automatic
methodologies. This type of data mining is often referred
to using other terms such as machine learning, knowledge discovery in databases (KDD), or predictive analytics. Data mining is limited to the discovery of patterns, whereas predictive analytics allows the application of the patterns to new data to impute (or predict) unknown values.
to using other terms such as machine learning, knowledge discovery in databases (KDD), or predictive analytics. Data mining is limited to the discovery of patterns, whereas predictive analytics allows the application of the patterns to new data to impute (or predict) unknown values.
Another definition of
data mining
"Data mining is the process of discovering meaningful new correlations,
patterns and trends by "mining" large amounts of stored data using
pattern recognition technologies, as well as statistical and mathematical techniques.
“ (Ashby, Simms (1998))
Some facts about data mining
1) Data mining works in a manner like our topographic map. It makes connections
within the data that may not be plain to the human observer.
2) The more data we must
extrapolate from, the better chance we have of making the correct prediction.
3)We can use the
patterns found in our data to make predictions of what will happen next. Data
mining may find patterns in the way our clients make use of our services. Based
on these patterns, we can predict which clients may need additional
services in the future.
4) When data warehousing
provides enterprise with a memory, data mining provides enterprise with
intelligence and power of prediction. Basically, source of data mining is data
warehousing, as we will get clean and verified data in data warehousing.
Understanding data and correlation of data which can apply on your line of
business can be achieved through data mining.
5)There are lot of data
mining add-ins for excel. Excel can show only facts, but cannot show hidden
relationships or patterns. standard excel allows direct interaction with the
raw data. Excel has some data mining add ins like table analysis tools and data
mining client.
So, what does data mining do, and why do you need it?
Enormous amount of data has been generated daily and stored in various databases. Much of this data comes from business software, such as financial applications, enterprise resource planning (ERP) systems, customer relationship management (CRM) systems, and server logs from web servers, or even the database servers hosting the data. So what companies are doing with this huge volume of data? Here comes the importance of data mining. Data mining helps to extract knowledge from this huge volume of data which makes it useful. Data mining techniques can be used in virtually all business applications, answering several types of businesses questions.
Typical business problems for Data Mining
1) Recommendation generation - Eg: amazon
2) Anomaly Detection
-Eg:credit card companies
3)Churn analysis -Eg:
Comcast vs AT&T. which customers are likely to switch to AT&T
from Comcast.
4) Risk management
-Eg: Chance of repaying the loan if a loan officer provides loan for an
individual
5) Customer
segmentation - determines the behavioral and descriptive
profiles for your customers. These profiles are then used to provide personalized
marketing programs and strategies that are appropriate for
each group.
profiles for your customers. These profiles are then used to provide personalized
marketing programs and strategies that are appropriate for
each group.
6) Targeted ads
-Eg: ads displayed in websites based on interest of individual
7) Forecasting -Eg:
chance of rain next week
Data mining tasks
Determining Data mining task is most important task. select appropriate algorithm which would do that task. Build data mining model, create mining structure by selecting from sources and apply algorithm on that data is called Training. After training, validate model, interpret results and deploy results based on business decision, type of issue, type of task etc.
¢ Descriptive
|
¢ Predictive
|
Clustering
|
Classification
|
Association
|
Prediction
|
Forecasting
|
Descriptive: No need of initial knowledge. Let the
algorithms do everything. It tells what to discover. But cannot use to take
decisions.
Clustering
used to group similar
data together to form a set of cohesive clusters. It is one of the popular
technique to find analysis results.
Example:
Buying recommendations
for customers in Amazon website
Demographics of social media
Finding characteristics of person based on social media behavior: https://www.technologyreview.com/s/427744/psychologists-use-social-networking-behavior-to-predict-personality-type/
Association analysis
Uncovers the
hidden patterns, correlations or untailored structures among a set of items or
objects.
Eg: cross selling /
market basket analysis
(Keeping beer aisle near
diaper aisle in a shopping market)
(Items frequently bought together)
Other examples: Multiple items could be grouped together
in a single sales transaction. Multiple services could be provided to a single-family
unit. Multiple classes could be taken by a student.
Predictive: Ability to classify/forecast/predict things
based on historical perspective.
Classification is used
to predict the value for a discrete attribute, meaning an attribute
that has one of a set number of distinct values. Regression, on the other hand, is used
to predict a continuous value. Like classification, regression also looks at relationships between the value being predicted and other continuous values available in the data.
that has one of a set number of distinct values. Regression, on the other hand, is used
to predict a continuous value. Like classification, regression also looks at relationships between the value being predicted and other continuous values available in the data.
Classification
Act of assigning a
category to a given entity we examine.
Example:
All 5 fruits on left side are apples. So, let us decide that the unknown fruit on right hand side is apple.
Another example:
How may percentage of woman will continue dream job after marriage?
Criteria's: supportive husband, supportive family, health good, have passion, continue job=> Yes
Regression
A predictive technique that reveals and measures the value of a given variable
(where values are continuous) in terms of other variables .
Example
Predict the sales amount of winter clothes in the month of November based on climate, thanksgiving sales and gender .
If climate is so cold+snow and thanksgiving sales will happen on Nov 3rd week, gender female will buy female winter clothes and male will buy male winter clothes. Sales will go high in November.
Algorithm selection table
Task
|
Algorithms to
use
|
Predicting a discrete
attribute.
For example, predict whether
the recipient of a targeted customer will buy a product. (Amazon’s “customer
who bought this item also bought”)
|
Decision Trees Algorithm
Naive Bayes Algorithm
Clustering Algorithm
Neural Network Algorithm
|
Predicting a continuous
attribute.
For example, forecast sales of
next 3 consecutive years
|
Decision Trees Algorithm
Time Series Algorithm
|
Predicting a sequence.
For example, perform a we
traffic of a particular Web site.
|
Sequence Clustering Algorithm
|
Finding groups of common items
in transactions.
For example, use market basket
analysis to suggest additional products to a customer for purchase.(Amazon’s “Frequently
bought together”)
|
Association Algorithm
Decision Trees Algorithm
|
Finding groups of similar items.
For example, segment
demographic data into groups to better understand the relationships between
attributes.
|
Clustering Algorithm
Sequence Clustering Algorithm
|
Data Mining Algorithms
1) Microsoft Decision Trees - it creates a tree structure during its training process.
Example
The main purpose of the Microsoft Decision Trees algorithm is Classification. From the example above, income of a person depends on the age as well as the status, whether a student or not. if Age< 30, low income, if age from 31-40, high income. if student, will check the age further to identify the income. some students will make income by working other than the study hours. A more detailed representation is given below.
2) Microsoft Linear Regression: A specialized implementation of the Microsoft
Decision Trees algorithm. This algorithm is used to model a linear relationship between two numeric variables. if we know the value of one variable, called the independent variable, we can predict the value of the other variable, called the dependent variable.
Decision Trees algorithm. This algorithm is used to model a linear relationship between two numeric variables. if we know the value of one variable, called the independent variable, we can predict the value of the other variable, called the dependent variable.
Microsoft Linear Regression algorithm can only be used for Regression.
3) Microsoft Naïve Bayes: The Naïve Bayes algorithm can only be used for Classification. It
clearly shows the differences in a
particular variable for various data elements. It also looks at each attribute of the entity in question and determines how that attribute, on its own, affects the attribute we are looking to predict. It does not consider combinations of attributes. (The Microsoft Naive Bayes algorithm is a classification algorithm that is quick to build, and works well for predictive modeling. The algorithm supports only discrete or discretized attributes, and it considers all the input attributes to be independent, given the predictable attribute.)
Example: Predict whether a customer is a good credit risk.
What Naïve Bayes says is that never extend credit to small companies and we should always extend credit to large companies looking at the diagram above.
4) Microsoft Clustering: Groups or clusters data based on a
sequence of previous events . It builds clusters of entities as it processes the training data set. Once the clusters are created, the algorithm analyzes the makeup of each cluster. It looks at the values of each attribute for the entities in the cluster. The main purpose of the Microsoft Clustering algorithm is Segmentation.
Example: A daily life representation
NB: Clustering inside Mining structures in SSAS
Cluster Profiles provides too much information, and Cluster Diagram provides too little, but together
they provide the topology of your cluster model. The Cluster Profiles view displays a column for each cluster in your model and a row for each attribute. This setup makes it easy to see interesting
differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means. The Cluster Profiles view displays everything in your model in a manner
that is easy to see. Binary and continuous attributes are particularly easy to discern, as are discrete attributes with a small number of states. Clicking any cell in the grid provides details on the information contained in the mining legend. Exploring your clusters through the Cluster Profiles view is a good way to find a starting point for further exploration.
they provide the topology of your cluster model. The Cluster Profiles view displays a column for each cluster in your model and a row for each attribute. This setup makes it easy to see interesting
differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means. The Cluster Profiles view displays everything in your model in a manner
that is easy to see. Binary and continuous attributes are particularly easy to discern, as are discrete attributes with a small number of states. Clicking any cell in the grid provides details on the information contained in the mining legend. Exploring your clusters through the Cluster Profiles view is a good way to find a starting point for further exploration.
Clusters which are similar shows the strongest link in Cluster Diagram. One method for picking a cluster is determining which clusters have the strongest link and choosing one of them; another
method is to pick a cluster that seems far removed from the rest.
method is to pick a cluster that seems far removed from the rest.
Cluster Characteristics view describes the characteristics of the cluster cases by displaying attributes in decreasing probability.
The bars of the Cluster Discrimination view indicate which cluster the attribute favors. It does not indicate that the other cluster doesn’t contain the attribute
How to determine the number of clusters to choose?
To accomplish this, go to the Cluster Diagram view and determine which clusters are close to the cluster of interest. If no links to the cluster are very strong, it is probably safe to stop.
5) Microsoft Association Rules: Helps identify relationships between
various elements. we must have entities that are grouped into sets within our data. It creates its own sets of entities and then
determines how often those sets occur in the test data set.The Microsoft Association Rules algorithm can only be used for Association.
6) Microsoft Sequence Clustering: Groups or clusters data based on a
sequence of previous events. It examines the test data set to identify transitions from one state to another. The test data set contains data, such as a web log showing navigation from one page to another, or perhaps routing and approval data showing the path taken to approve each request. The algorithm uses the test data set to
determine, as a ratio, how many times each of the possible paths is taken. The main purposes of the Microsoft Sequence Clustering algorithm are Sequence Analysis and Segmentation.
Example:
Consider a group of people who share similar demographic information and who buy similar products from Indian Grocery store. This group of people (North Indians/ South Indians) represents a cluster of data.
More example present here:
7) Microsoft Time Series : The Microsoft Time Series algorithm is used for analyzing and predicting time-dependent data. The Microsoft Time Series algorithm can only be used for
Regression.
Regression.
8) Microsoft Neural Network: Seeks to uncover non-intuitive
relationships in data. It creates a web of nodes that connect inputs derived from attribute values to a final output. The main purposes of the Microsoft Neural Network algorithm are Classification and Regression. (The Microsoft Neural Network algorithm uses a gradient method to optimize parameters of multilayer networks to predict multiple attributes. It can be used for classification of discrete attributes as well as regression of continuous attributes.)
Example: Text mining
9) Microsoft Logistic Regression Algorithm:
Determines the relationship between
columns in order to evaluate the probability that a column will contain a
specific state It is a special form of the Microsoft Neural Network algorithm. Logistic regression is used to model situations where there are one of two possible outcomes. A customer will or will not buy a given product. A person will or will not develop a certain medical condition. The Microsoft Logistic Regression algorithm uses a neural network to model the influence of a number of factors on a true/false outcome. The magnitude of the various influences is weighed to determine which factors are the best predictors of a given outcome. The “logistical” part of the name comes from a mathematical transformation, called a logistic transformation, that is used to minimize the effect of extreme values in the model. The Microsoft Logistic Regression algorithm can be used for Regression.
Usage of data mining structure
The first step in doing any mining is to select the data being mined. We select a data
source: either relational data or an OLAP cube. Once this is done, we need to select a
table, if this is a relational data source, or a dimension, if this is an OLAP data source.
Finally, we must select the table columns or dimension attributes to be used as the data
columns for our data mining.
Data Mining Model
The data mining model combines the data columns with a data mining algorithm. In
addition, we must determine how each data column should be used by the data mining
algorithm. This process determines how the data mining algorithm functions and what
it predicts.
Data Column Usage
Usage of data mining structure
The first step in doing any mining is to select the data being mined. We select a data
source: either relational data or an OLAP cube. Once this is done, we need to select a
table, if this is a relational data source, or a dimension, if this is an OLAP data source.
Finally, we must select the table columns or dimension attributes to be used as the data
columns for our data mining.
Data Mining Model
The data mining model combines the data columns with a data mining algorithm. In
addition, we must determine how each data column should be used by the data mining
algorithm. This process determines how the data mining algorithm functions and what
it predicts.
Data Column Usage
Key The key is the unique identifier for a table or a dimension.
Input These columns are used by the data mining algorithm when making a
prediction.
Predict Only A predict only is a data column whose value is being predicted by
the data mining algorithm.
Ignore This data column is not used by the data mining algorithm.
Predict A predict is a data column whose value is being predicted by the data
mining algorithm. This column can also be used as an input column.
Thanks Charles.
ReplyDelete