Data Warehousing and Database

Monday, May 3, 2010

Slowly changing dimension

As part of the data mart design process, you should identify which of your dimensions will be changing over time, and which will be static. One you have identified the changing dimensions, the next design decision centers on the tracking of history in dimensions. You have two options; you can decide to store history about changes in a dimension, or to simply keep the most recent information, and not store the history. The exact decision you make really depends on the nature of the change, and how the data mart should reflect the change. Sometimes storing history is appropriate, in other cases it's not necessary. No matter which decision you do make, it shouldn't be influenced by your ability to create the necessary mappings.

There are five basic implementations for dimension tables in a data mart, and each has a corresponding mapping model in PowerCenter. This table contains a summary of the characteristics of these basic types.
Dimension Type
Change Processing
History Tracking
Notes

A.
Static
None
None

B.
Slowly Growing
New records added
None

C.
Slowly Changing, Update with Changes
Content updated
None
commonly known as a "Type 1" slowly changing dimension

D.
Slowly Changing, Versioning of Changes
Content updated, new records added
Full
Commonly known as a "Type 2" slowly changing dimension

E.
Slowly Changing, Keep Current and Previous values
Content updated
Partial
Commonly known as a "Type 3" slowly changing dimension

Usually, a dimension like time will be static (although it may need to be refreshed occasionally to extend it with new entries). Examples of some other common static dimensions are transaction types, shipping method, and status dimensions of various types. These dimensions fall under type A in the table above.

Examples of dimensions that change over time are the customer dimension, product dimensions, and many dimensions that include organizational hierarchy information. This last group is important, since companies regularly reorganize themselves. In the event of a reorganization, it's necessary to update the dimension to reflect the new organization, sometimes the old one as well.

Changes in dimensions fall into two subcategories. The first type of change is growth - simply adding new dimension records over time, like new products or customers. This is type B in the table. The second type of change is a change in the content of the dimension records, and this is the case where a decision needs to be made about tracking history. Types C, D, and E in the table all deal with changes in the content of the dimension, and provide different options for tracking history.

To demonstrate the mapping logic used in these situations, this article will cover examples of types B, C, D, and E. In this case, they will all be variations of a customer dimension, since it's probably the most easily understood example, allowing us to focus on the implementation. In addition, a customer dimension is often required in a data mart project, so it provides a real world reference.

Basic Building Blocks for Dimensions in PowerCenter
The next issue we have to deal with is how to implement these mappings in PowerCenter. Let's take a look at the same dimension types again, but this time in terms of the per row operations required at transformation time for each case.

Dimension Type
Row Level Operation
Change Detection
Notes

A.
Static
Insert

B.
Slowly Growing
Insert if new
None

C.
Slowly Changing, Update with Changes
Insert if new, update if changed
Optional
Can update entire record instead of detecting changes.

D.
Slowly Changing, Versioning of Changes
Insert if new, Update and Insert if changed
Required
Requires up to two operations at the target per source row

E.
Slowly Changing, Keep Current and Previous values
Insert if new, Update if changed
Required

All these dimensions have some characteristics in common, that provide clues about the general structure of the mappings. First of all, on a row-by-row basis, the mapping needs to decide the appropriate operation at the target, either insert or update. This means the mapping needs to take advantage of the per row operation control available in PowerCenter, using an update strategy transformation. In addition, when updates of an existing row are involved, the primary key for the row that is already in the target needs to be determined. Normally, you use a Lookup transformation to find out whether a record already exists in a target.

One of the trickier issues to deal with is identifying whether or not there has been a change since the last refresh. If the source actually provides a change indicator, this is simple. It's also rare - in most situations, it will be necessary to compare the new source with the existing dimension data, to determine if and where a change has occurred. We can do this with a lookup transformation that retrieves the 'old' data from the target, and an expression transformation to compare the current and old values. As a general rule, not all changes in the source are relevant, only some, so 'change' is an application specific thing. For example, a customer note field in the source might be irrelevant, but an address change is important.

Finally, for type D in the table, we may need to perform two operations on the target for a single source row. This means bringing two instances of the same target into the mapping, and splitting the data flow so it goes to both instances. Filter transformations will provide control over which paths the data takes.

This gives us our basic building blocks for dimensional mappings:

Update Strategy transformation to decide insert or update
Lookup transformation to determine whether records already exist in the target
Lookup transformation to get the previous values from the targets
Filter transformations to control the routing of the data to the target
Expression transformations to implement decision logic like comparisons

Friday, January 22, 2010

Starting with informatica 8.6

PowerCenter provides an environment that allows you to load data into a centralized location, such as a data warehouse or operational data store (ODS). You can extract data from multiple sources, transform the data according to business logic you build in the client application, and load the transformed data into file and relational targets.

some important terminologies in informatica are:

Repository Manager. You use the Repository Manager to create a folder to store the metadata you create in the lessons.

Designer. Use the Designer to create mappings that contain transformation instructions for the Integration Service. Before you can create mappings, you must add source and target definitions to the repository. In this tutorial, you use the following tools in the Designer:

- Source Analyzer. Import or create source definitions.

- Target Designer. Import or create target definitions. You also create tables in the target database based on the target definitions.

- Mapping Designer. Create mappings that the Integration Service uses to extract, transform, and load data.

Workflow Manager. Use the Workflow Manager to create and run workflows and tasks. A workflow is a set of instructions that describes how and when to run tasks related to extracting, transforming, and loading data.

Workflow Monitor. Use the Workflow Monitor to monitor scheduled and running workflows for each Integration Service.

Monday, January 18, 2010

Starting with tools in DWH

Data Warehouse is repository of Data picked from Transaction systems, and filtered and transformed to make it available for data analysis reporting.

Data-Warehouse is a repository of Data, which can provide most OR all of the Data and information requirements of an enterprise. It means that it pulls data from all the production and other sources. Once the data is pulled onto an offline staging area, it is cleansed, transformed and loaded into a sanitized, uniform and well-organized manner so that you can run queries, reports and all kind of analysis on the data.

There are a number of important tools which are connected to data warehouses, and one of these is data aggregation.Data warehousing requires a hardware platform, database, data loading method (ETL tools), and data extract method (reporting and business intelligence tools). Informatica is a famous ETL tool and Buisness Object is a reporting tool..

Tuesday, December 15, 2009

Data Analysis Tools

Data analysis tools are used to perform complex analysis of data.They will normally have a rich set of analytic functions,which allow sophisticated analysis of data.These tools are designed for buisness analysis.
Data analysis tools can again subdivided into MOLAP tools and ROLAP tools.They came into existence because the term OLAP(On-Line Analytical Processing)has become an industry buzzword.
On-Line Analytical Processing is a decision support software that allows the user to quickly analyze information that has been summarized into multidimensional views and hierarchies.
On Line – Emphasizes live access to data, not static reporting
Analytical Processing – Ad-hoc queries, drill-down, roll-up, reporting across various dimension.
OLAP databases are also known as multidimensional databases or MDDBs.
Multidimensional analysis is a technique whereby data can be analyzed in many dimensions at once.The term dimension here means an attribute such as cost,duration or name.They are equivalent to a column in a relational table.The idea is that instead of analyzing the data in a two-dimensional table,the data is loaded into a multidimensional hypercube to be analyzed.

MOLAP(Multidimensional OLAP) tools have an SQL interface that allows them to extract data from a relational database.On a predefined set of data,MOLAP gives userfriendly,fast access to poweful analytical and statistical functions.They are good to use for analyzing aggregated data with the dimension data.It do very well in dimensional slicing.if data has been loaded into the cube dimensioned by office,region,sales-quantity,sales-value and product,the tool will be able to switch from displaying data by region to displaying data by product or by sales-value.

ROLAP(Relational OLAP) are the traditional SQL-oriented tools that have tight integration to the relational model.They are changing all the time.They use metadata to isolate the user from the complexities of the data warehouse,and to present a buisness perspective of the data.They can be used as data browsers,and will have good drill-down capability from aggregation to detailed data.

Operational Data Store(ODS)

An operational data store (ODS) is a type of database that's often used as an interim logical area for a data warehouse.ODS was meant to serve as the point of integration for operational systems.
For example,Banks had several independent systems-loans,checking accounts,saving accounts and so on..The teller support computers and ATM helped push many banks to create an ODS to integrate current balances and recent history from these seperate accounts under one customer number.

ODS can be used for integrating disparate data from multiple sources so that business operations, analysis and reporting can be carried out while business operations are occurring. This is the place where most of the data used in current operation is housed before it's transferred to the data warehouse for longer term storage or archiving.

An ODS is designed for relatively simple queries on small amounts of data (such as finding the status of a customer order), rather than the complex queries on large amounts of data typical of the data warehouse.

ODS

provides recent data
Detailed and lightly summarized data
Subject-oriented
Homogeneous
Support day-to-day decisions & operational activities
Individual records, transaction or analysis driven

Monday, December 14, 2009

Data Modelling in Data warehousing Part 2

In Data warehouse,We process large amount of data and very few updates or deletes.
The data warehouse has mixed modeling approach of relational modeling and dimensional modeling.

Relational Modeling or Entity-Relationship Modeling is used to illuminate the microscopic relationships among data elements.It removes data redundancy.It makes transaction processing very simple and deterministic.It is more suited for Traditional modeling technique, OLTP and corporate data warehouse.But for the enterprise needs,it become more complex.End users cannot understand or remember an ER model.So the dimensional model comes into picture.

From data warehouse,it is moved to Data mart which is specific to a business process. The data here is kept in denormalized form. It is mostly in dimensional model or flat model..From data mart the reports are created or the data can be pulled by OLAP system for reporting.
Dimensional modeling approach is

best way to model decision support data.
only technique for delivering data to end users in a data warehouse.
a logical design technique that seeks to present the data in a standard framework.

Every dimensional model is composed of one table with a multi-part key called fact table,and a set of smaller tables called dimension tables,which i describe in my earlier posts.Each dimension table has a single part primary keythat corresponds exactly to one of the components of the multipart key in the fact table.This characteristic star-like structure is called a star-join.Examples: sales by city, sales by state, sales by country, sales by region etc are facts.The data from multiple tables of city, state, country, region of source system are stored in Location Dimension.The sales table contain foriegn keys which are the primary keys of dimension tables which together forms a star schema.

Sunday, December 13, 2009

Data Modelling Basics Part 1

Data Modeling is a process by which the enterprise business model is defined in terms of data elements and the relationships existing among those data elements.
There are three types of data models.
1)The conceptual model mostly suits business people.
2)The logical model which is derived from the conceptual model. It suits designers. It contains entity and attributes details.
3)The physical model which is derived from the Logical model. This mostly suits the Implementers. It is database specific.The level of abstraction is highly specific.