Data Warehousing and Database: November 2017

Data mining (Knowledge discovery in databases (KDD)): Use of a complex mathematical algorithm to sift through detail data to identify patterns, correlations, and clustering within the data. It is the act of excavation in the data from which patterns can be extracted. It is the process of analyzing data to find hidden patterns using automatic methodologies. This type of data mining is often referred
to using other terms such as machine learning, knowledge discovery in databases (KDD), or predictive analytics. Data mining is limited to the discovery of patterns, whereas predictive analytics allows the application of the patterns to new data to impute (or predict) unknown values.

Another definition of data mining

"Data mining is the process of discovering meaningful new correlations, patterns and trends by "mining" large amounts of stored data using pattern recognition technologies, as well as statistical and mathematical techniques. “ (Ashby, Simms (1998))

Some facts about data mining

1) Data mining works in a manner like our topographic map. It makes connections
within the data that may not be plain to the human observer.

2) The more data we must extrapolate from, the better chance we have of making the correct prediction.

3)We can use the patterns found in our data to make predictions of what will happen next. Data mining may find patterns in the way our clients make use of our services. Based on these patterns, we can predict which clients may need additional services in the future.

4) When data warehousing provides enterprise with a memory, data mining provides enterprise with intelligence and power of prediction. Basically, source of data mining is data warehousing, as we will get clean and verified data in data warehousing. Understanding data and correlation of data which can apply on your line of business can be achieved through data mining.

5)There are lot of data mining add-ins for excel. Excel can show only facts, but cannot show hidden relationships or patterns. standard excel allows direct interaction with the raw data. Excel has some data mining add ins like table analysis tools and data mining client.

So, what does data mining do, and why do you need it?

Enormous amount of data has been generated daily and stored in various databases. Much of this data comes from business software, such as financial applications, enterprise resource planning (ERP) systems, customer relationship management (CRM) systems, and server logs from web servers, or even the database servers hosting the data. So what companies are doing with this huge volume of data? Here comes the importance of data mining. Data mining helps to extract knowledge from this huge volume of data which makes it useful. Data mining techniques can be used in virtually all business applications, answering several types of businesses questions.

Typical business problems for Data Mining

1) Recommendation generation - Eg: amazon

2) Anomaly Detection -Eg:credit card companies

3)Churn analysis -Eg: Comcast vs AT&T. which customers are likely to switch to AT&T from Comcast.

4) Risk management -Eg: Chance of repaying the loan if a loan officer provides loan for an individual

5) Customer segmentation - determines the behavioral and descriptive
profiles for your customers. These profiles are then used to provide personalized
marketing programs and strategies that are appropriate for
each group.

6) Targeted ads -Eg: ads displayed in websites based on interest of individual

7) Forecasting -Eg: chance of rain next week

Data mining tasks

Determining Data mining task is most important task. select appropriate algorithm which would do that task. Build data mining model, create mining structure by selecting from sources and apply algorithm on that data is called Training. After training, validate model, interpret results and deploy results based on business decision, type of issue, type of task etc.

¢ Descriptive	¢ Predictive
Clustering	Classification
Association	Prediction
	Forecasting

Descriptive: No need of initial knowledge. Let the algorithms do everything. It tells what to discover. But cannot use to take decisions.

Clustering

used to group similar data together to form a set of cohesive clusters. It is one of the popular technique to find analysis results.

Example:

Buying recommendations for customers in Amazon website

Demographics of social media

Finding characteristics of person based on social media behavior: https://www.technologyreview.com/s/427744/psychologists-use-social-networking-behavior-to-predict-personality-type/

Association analysis

Uncovers the hidden patterns, correlations or untailored structures among a set of items or objects.

Eg: cross selling / market basket analysis

(Keeping beer aisle near diaper aisle in a shopping market)

(Items frequently bought together)

Other examples: Multiple items could be grouped together in a single sales transaction. Multiple services could be provided to a single-family unit. Multiple classes could be taken by a student.

Predictive: Ability to classify/forecast/predict things based on historical perspective.

Classification is used to predict the value for a discrete attribute, meaning an attribute
that has one of a set number of distinct values. Regression, on the other hand, is used
to predict a continuous value. Like classification, regression also looks at relationships between the value being predicted and other continuous values available in the data.

Classification

Act of assigning a category to a given entity we examine.

Example:

All 5 fruits on left side are apples. So, let us decide that the unknown fruit on right hand side is apple.

Another example:

How may percentage of woman will continue dream job after marriage?

Criteria's: supportive husband, supportive family, health good, have passion, continue job=> Yes

Regression

A predictive technique that reveals and measures the value of a given variable (where values are continuous) in terms of other variables .

Example

Predict the sales amount of winter clothes in the month of November based on climate, thanksgiving sales and gender .

If climate is so cold+snow and thanksgiving sales will happen on Nov 3rd week, gender female will buy female winter clothes and male will buy male winter clothes. Sales will go high in November.

Algorithm selection table

Task	Algorithms to use
Predicting a discrete attribute. For example, predict whether the recipient of a targeted customer will buy a product. (Amazon’s “customer who bought this item also bought”)	Decision Trees Algorithm Naive Bayes Algorithm Clustering Algorithm Neural Network Algorithm
Predicting a continuous attribute. For example, forecast sales of next 3 consecutive years	Decision Trees Algorithm Time Series Algorithm
Predicting a sequence. For example, perform a we traffic of a particular Web site.	Sequence Clustering Algorithm
Finding groups of common items in transactions. For example, use market basket analysis to suggest additional products to a customer for purchase.(Amazon’s “Frequently bought together”)	Association Algorithm Decision Trees Algorithm
Finding groups of similar items. For example, segment demographic data into groups to better understand the relationships between attributes.	Clustering Algorithm Sequence Clustering Algorithm

Data Mining Algorithms

1) Microsoft Decision Trees - it creates a tree structure during its training process.

Example

The main purpose of the Microsoft Decision Trees algorithm is Classification. From the example above, income of a person depends on the age as well as the status, whether a student or not. if Age< 30, low income, if age from 31-40, high income. if student, will check the age further to identify the income. some students will make income by working other than the study hours. A more detailed representation is given below.

Source: https://www.edureka.co/blog/decision-trees/

2) Microsoft Linear Regression: A specialized implementation of the Microsoft
Decision Trees algorithm. This algorithm is used to model a linear relationship between two numeric variables. if we know the value of one variable, called the independent variable, we can predict the value of the other variable, called the dependent variable.

Microsoft Linear Regression algorithm can only be used for Regression.

Example:

Frozen yogurt sold can be predicted using Linear regression. Source:

https://docs.microsoft.com/en-us/azure/machine-learning/studio/algorithm-choice

3) Microsoft Naïve Bayes: The Naïve Bayes algorithm can only be used for Classification. It

clearly shows the differences in a particular variable for various data elements. It also looks at each attribute of the entity in question and determines how that attribute, on its own, affects the attribute we are looking to predict. It does not consider combinations of attributes. (The Microsoft Naive Bayes algorithm is a classification algorithm that is quick to build, and works well for predictive modeling. The algorithm supports only discrete or discretized attributes, and it considers all the input attributes to be independent, given the predictable attribute.)

Example: Predict whether a customer is a good credit risk.

What Naïve Bayes says is that never extend credit to small companies and we should always extend credit to large companies looking at the diagram above.

4) Microsoft Clustering: Groups or clusters data based on a sequence of previous events . It builds clusters of entities as it processes the training data set. Once the clusters are created, the algorithm analyzes the makeup of each cluster. It looks at the values of each attribute for the entities in the cluster. The main purpose of the Microsoft Clustering algorithm is Segmentation.

Example: A daily life representation

NB: Clustering inside Mining structures in SSAS

Cluster Profiles provides too much information, and Cluster Diagram provides too little, but together
they provide the topology of your cluster model. The Cluster Profiles view displays a column for each cluster in your model and a row for each attribute. This setup makes it easy to see interesting
differences across the cluster space. Using this view, you can choose an attribute of interest and visibly scan horizontally to see its distribution across all clusters. When an item catches your interest, you can look at neighboring cells or other cells of the same cluster to learn more about what that cluster means. The Cluster Profiles view displays everything in your model in a manner
that is easy to see. Binary and continuous attributes are particularly easy to discern, as are discrete attributes with a small number of states. Clicking any cell in the grid provides details on the information contained in the mining legend. Exploring your clusters through the Cluster Profiles view is a good way to find a starting point for further exploration.

Clusters which are similar shows the strongest link in Cluster Diagram. One method for picking a cluster is determining which clusters have the strongest link and choosing one of them; another
method is to pick a cluster that seems far removed from the rest.

Cluster Characteristics view describes the characteristics of the cluster cases by displaying attributes in decreasing probability.

The bars of the Cluster Discrimination view indicate which cluster the attribute favors. It does not indicate that the other cluster doesn’t contain the attribute

How to determine the number of clusters to choose?

To accomplish this, go to the Cluster Diagram view and determine which clusters are close to the cluster of interest. If no links to the cluster are very strong, it is probably safe to stop.

5) Microsoft Association Rules: Helps identify relationships between various elements. we must have entities that are grouped into sets within our data. It creates its own sets of entities and then

determines how often those sets occur in the test data set.The Microsoft Association Rules algorithm can only be used for Association.

6) Microsoft Sequence Clustering: Groups or clusters data based on a sequence of previous events. It examines the test data set to identify transitions from one state to another. The test data set contains data, such as a web log showing navigation from one page to another, or perhaps routing and approval data showing the path taken to approve each request. The algorithm uses the test data set to

determine, as a ratio, how many times each of the possible paths is taken. The main purposes of the Microsoft Sequence Clustering algorithm are Sequence Analysis and Segmentation.

Example:

Consider a group of people who share similar demographic information and who buy similar products from Indian Grocery store. This group of people (North Indians/ South Indians) represents a cluster of data.

More example present here:

https://docs.microsoft.com/en-us/sql/analysis-services/data-mining/microsoft-clustering-algorithm

7) Microsoft Time Series : The Microsoft Time Series algorithm is used for analyzing and predicting time-dependent data. The Microsoft Time Series algorithm can only be used for
Regression.

8) Microsoft Neural Network: Seeks to uncover non-intuitive relationships in data. It creates a web of nodes that connect inputs derived from attribute values to a final output. The main purposes of the Microsoft Neural Network algorithm are Classification and Regression. (The Microsoft Neural Network algorithm uses a gradient method to optimize parameters of multilayer networks to predict multiple attributes. It can be used for classification of discrete attributes as well as regression of continuous attributes.)

Example: Text mining

9) Microsoft Logistic Regression Algorithm:

Determines the relationship between columns in order to evaluate the probability that a column will contain a specific state It is a special form of the Microsoft Neural Network algorithm. Logistic regression is used to model situations where there are one of two possible outcomes. A customer will or will not buy a given product. A person will or will not develop a certain medical condition. The Microsoft Logistic Regression algorithm uses a neural network to model the influence of a number of factors on a true/false outcome. The magnitude of the various influences is weighed to determine which factors are the best predictors of a given outcome. The “logistical” part of the name comes from a mathematical transformation, called a logistic transformation, that is used to minimize the effect of extreme values in the model. The Microsoft Logistic Regression algorithm can be used for Regression.

Usage of data mining structure

The first step in doing any mining is to select the data being mined. We select a data
source: either relational data or an OLAP cube. Once this is done, we need to select a
table, if this is a relational data source, or a dimension, if this is an OLAP data source.
Finally, we must select the table columns or dimension attributes to be used as the data
columns for our data mining.

Data Mining Model

The data mining model combines the data columns with a data mining algorithm. In
addition, we must determine how each data column should be used by the data mining
algorithm. This process determines how the data mining algorithm functions and what
it predicts.

Data Column Usage

Key The key is the unique identifier for a table or a dimension.

Input These columns are used by the data mining algorithm when making a

prediction.

Predict Only A predict only is a data column whose value is being predicted by

the data mining algorithm.

Ignore This data column is not used by the data mining algorithm.

Predict A predict is a data column whose value is being predicted by the data

mining algorithm. This column can also be used as an input column.

SQL	PL/SQL
Used to access data within Oracle database	Used to access data within oracle databases
Not include all normal programming language features.	Includes all normal programing language features, e.g. loops, IF…THEN…ELSE statements
Tells the database what to do (declarative)	Tells the database how to do things (procedural)
Is used to code queries, DML and DDL statements	Used to code program blocks, triggers, functions, procedures and packages
Is executed one statement at a time	Executed as a block of codes
Can be embedded in a PL/SQL program	Can’t be embedded in a SQL statement
	PL/SQL has the ability to easily integrate with SQL Eg 1: SET SERVEROUTPUT ON SIZE 4000; DECLARE todays_date DATE; BEGIN todays_date := SYSDATE; DBMS_OUTPUT.PUT_LINE('Today''s date is ' \|\| todays_date); END; Eg 2: SET SERVEROUTPUT ON SIZE 4000; DECLARE todays_date DATE; BEGIN todays_date := SYSDATE; DBMS_OUTPUT.PUT_LINE('Today''s date is ' \|\| TO_CHAR(todays_date, 'Month DD, YYYY')); END

Data Warehousing and Database

Saturday, November 25, 2017

SQL vs PL/SQL

Saturday, November 11, 2017

Data Mining

GEN AI

Total Pageviews

My Favourites