Data Warehousing and Database: Data warehousing terms

Wednesday, December 20, 2017

Data warehousing terms

Common BI queries

Relational queries Vs OLAP queries

Before moving to terms used in Data warehouse, let us see the architecture of Data Warehouse.

Data warehouse- Storage of an organization's data for analytical purpose for decision making.
Good data and smart analysis gives right decision.

DWH stores enterprise data for measuring and reporting. It supports report and analysis, maintain history of specific areas in a company, and support high level queries (aggregated data) from dashboards.

Data sources (OLTP, historical data repository, external sources like flat files, excel etc.) -> Data staging (ETL) ->Data storage (DWH, Datamarts)->Data presentation(reporting,OLAP, data mining)

DataMart specific to business areas. Data warehouse consists of multiple datamarts.

Source system (OLTP) -> normalized for transaction purpose. Only current data.

Data warehouse (OLAP) -> contains historical data (aggregated data). So denormalized.

Star schema is denormalized. One fact and dimension table around it.

Snow flake -> one fact and dimensions again split up into different other dimensions.

ETL Architecture

Dimensional Modeling: Dimensional Data Modeling comprises of one or more-dimension tables and fact tables. Good examples of dimensions are location, product, time, promotion, organization etc. A fact (measure) table contains measures (sales gross value, total units sold) and dimension columns. It is a consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling.

Fact

who, what, when, where

How much – measures which are quantitative measures

Different Dimension columns in a table and a surrogate key

Aggregate level of data is important .
A fact is something that is measurable. Facts are typically (but not always) numerical values that can be aggregated.
Dimension

Describe facts

Look up values from traditional OLTP –geography, customers, employees

Changing dimensions

Slowly changing dimensions (employees getting married, have child, change house) – dimensions changing over time.

Static Data – data never changes Eg: color

Type 0 dimension

No history – overwrite

Type 1 dimension

Impact when no history is needed – business people pulling reports will get confused

Tracking change is important

When a dimension is changed, a new record is inserted, and old one dated. Surrogate id changed. It allows to have primary key multiple times.

Seperating history from day to day needs – type 4 dimensions

History table has its own surrogate key

Conformed Dimensions: when pulling data from multiple systems, three different primary keys have to reconcile. Same product surrogate key.

To summarize, a dimension is something that qualifies a quantity (measure).

For example, consider this: If I just say… “5kg”, it does not mean anything. But if I say, “5kg of Wheat (Product) is sold to Albert (customer) on Dec 25th (date)”, then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure – 5kg.

Dimensions are mutually independent and is a data element that categorizes each item in a data set into non-overlapping regions.
Star Schema: Fact in center and dimensions surrounded it.

it describe denormalized structure of data warehouse which contains dimension tables and fact tables. Dimension has only single table. There might be a key column to uniquely identify each record in a table. Primary key also called surrogate key in dwh.

In star schema, enter desired facts and all the primary keys of dimensional tables in Fact table. And fact tables are the union of its all dimension table key. In star schema, dimensional tables are usually not in BCNF form. Star flake schemas are normalized to remove any redundancies in the dimensions.

Snowflake Schema: – attributes and dimensions become dimensions itself—difficult to report an performance.
It’s almost like star schema, but in this dimension tables are in 3rd NF and dimensions are further connected to other dimensions tables. And these dimension tables are linked by primary, foreign key relation. The process of removing data redundancy which helps to overcome normal data modeling problems is known as normalization. For 3NF, first, the table must be in 2NF, plus, we want to make sure that the non-key fields are dependent upon ONLY the PK, and no other non-key fields for its existence. An entity is in the second normal form if all its attributes depend on the whole (primary) key. In relational terms, every column in a table must be functionally dependent on the whole primary key of that table. The visual representation of objects in a relational database (usually a normalized) is called as relational data modeling. Table contains rows and columns.

To understand difference between star schema and snowflake schema

Measures are numeric value which are the basis of analysis. Eg: sales, number of items sold, cost of no of items sold. Combining multiple measures into measure group rule is all measures in the same measure group will need to provide consistent granularity. Data for each measure is at the same level.

Non-additive Measures: Non-additive measures are those which cannot be used inside any numeric aggregation function (e.g. SUM (), AVG () etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Semi Additive Measures: Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum () function on balance does not give a useful result but max () or min () balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Additive Measures: Additive measures can be used with any aggregation function like Sum (), Avg () etc. Example is Sales Quantity etc.

Junk dimension

A grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.

ERD- Entity Relationship Diagram
Explaining ERD based on an Academic project 'PFEP' at PSMI, Saline. DBA and Project manager explained the data requirements and assumption in the PFEP system from a high-level perspective. Based on the explanation, we created three basic elements.

Entities: “Things” about which we seek information. E.g.: Department is an entity
Attributes: Data we collect about the entities. E.g.: department id, supervisor name, email, phone number, plant_id.
Relationships: Provide the structure needed to draw information from multiple entities. E.g.: department – plant Relationship (Each plant have one or many departments, Each department exist in only one plant) Connected through foreign key relationship.

Creating ERD using this basic element will help to do physical design of tables.

Surrogate Key

Surrogate key is a substitute for natural keys. Instead of having primary key or composite primary keys, the data modelers create a surrogate key; this is very useful for creating SQL queries, uniquely identify a record and superior performance.

Conceptual |Logical | Physical Data Modeling:

Conceptual data model includes all major entities and relationships and does not contain much detailed level of information about attributes and is often used in the initial planning phase. Data Modelers create conceptual data model and forward that model to functional team for their review. The approach by which conceptual data models are created is called as conceptual data modeling.
When a data modeler works with the client, his title may be a logical data modeler or a physical data modeler or combination of both. A logical data modeler designs the data model to suit business requirements, creates and maintains the lookup data, compares the versions of data model, maintains change log, generate reports from data model and whereas a physical data modeler has to know about the source and target databases properties.
A physical data modeler should know the technical-know-how to create data models from existing databases and to tune the data models with referential integrity, alternate keys, indexes and how to match indexes to SQL code. It would be good if the physical data modeler knows about replication, clustering and so on.

Primary key constraint

Primary key constraint is imposed on the column data to avoid null values and duplicate values. Primary Key=Unique + Not Null. Example: social security number, bank account number, bank routing number

Foreign key constraint

Parent table has primary key and a foreign key constraint is imposed on a column in the child table. The foreign key column value in the child table will always refer to primary key values in the parent table.

Relational vs Dimensional

Relational Data Modeling is used in OLTP systems which are transaction oriented and Dimensional Data Modeling is used in OLAP systems which are analytical based. In a data warehouse environment, staging area is designed on OLTP concepts, since data must be normalized, cleansed and profiled before loaded into a data warehouse or data mart. In OLTP environment, lookups are stored as independent tables in detail whereas these independent tables are merged as a single dimension in an OLAP environment like data warehouse.

Steps to create a Data Model

1. Get Business requirements.
2. Create High Level Conceptual Data Model.
3. Create Logical Data Model.
4. Select target DBMS where data modeling tool creates the physical schema.
5. Create standard abbreviation document according to business standard.
6. Create domain.
7. Create Entity and add definitions.
8. Create attribute and add definitions.
9. Based on the analysis, try to create surrogate keys, super types and sub types.
10. Assign datatype to attribute. If a domain is already present, then the attribute should be attached to the domain.
11. Create primary or unique keys to attribute.
12. Create check constraint or default to attribute.
13. Create unique index or bitmap index to attribute.
14. Create foreign key relationship between entities.
15. Create Physical Data Model.
16. Add database properties to physical data model.
17. Create SQL Scripts from Physical Data Model and forward that to DBA.
18. Maintain Logical and Physical Data Model.

Data Modeler Role

Business Requirement Analysis
Interact with Business Analysts to get the functional requirements.
Interact with end users and find out the reporting needs.
Conduct interviews, brain storming discussions with project team to get additional requirements.
Gather accurate data by data analysis and functional analysis.
Development of data model:
Create standard abbreviation document for logical, physical and dimensional data models.
Create logical, physical and dimensional data models (data warehouse data modelling).
Document logical, physical and dimensional data models (data warehouse data modelling).

Reports
Generate reports from data model.

Review:

Review the data model with functional and technical team.

Creation of database:

Create sql code from data model and co-ordinate with DBAs to create database.

Check to see data models and databases are in synch.

Support & Maintenance:

Assist developers, ETL, BI team and end users to understand the data model.

Maintain change log for each data model.

Data mart

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

Slowly changing dimensions (SCD)
SCD is abbreviation of Slowly changing dimensions. SCD applies to cases where the attribute for a record varies over time.
There are three different types of SCD.
1) SCD1 : The new record replaces the original record. Only one record exist in database – current data.
2) SCD2 : A new record is added into the customer dimension table. Two records exist in database – current data and previous history data.
3) SCD3 : The original data is modified to include new data. One record exist in database – new information are attached with old information in same row.

ER Modeling and Dimensional Modeling

ER modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

Cube

They are multi-dimensional objects made of measures and dimensions.

Dimensions provide the structure of the cube. They come from Dimension tables in star schema. Measures provide data in the cube. They come from fact tables in star schema and are aggregated at multiple levels along dimension.

Date Dimension: Year-> Quarter -> Month -> Day -> Hour

Understanding Hierarchies

Level (All Products) -> Category Level (Accessories, components) -> Product Level (Accessories- Bike stand, Bike wash), components- Chain, front brakes)

Understanding Measures

Fact table data become cube measures. E.g.: Product, order date, geography, sales etc.

Understanding Aggregations

Performed across levels. Rolls up occur on all dimensions. Aggregations are calculated and results are stored in the cube. Aggregations are performed across all levels across all hierarchies along all dimensions.

In other words,cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values.
Data warehousing and business intelligence
Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools.
View Vs Materialized View
A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables.
A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.
ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being re-engineered to use the ODS as their operation databases.

ETL
ETL is abbreviation of extract, transform, and load. ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that that data is in different forms or formats. The data can come from any source. ETL is powerful enough to handle such data disparities. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data – using rules or lookup tables, or creating combinations with other data – to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database.

Summary

1) ETL – Extract, Transform and Load
SSIS – SQL server Integration services
2) Analytics – Aggregation – Trending - Correlations
SSAS – SQL server analysis services- Develop analysis service db, dimensions, cubes using ssas and deploy the cube then connect to the server in ssms . open the deployed cube. Select all dimensions and process it. During processing, data is pulled from dwh and stored in ssas dimensions and cubes. Then browse the cube using cube browser in BIDS.

Database development flow in analysis services

BIDS(Business intelligence development studio) –>Analysis services project

3)     Reporting
SSRS-SQL server reporting services
Sharepoint performance point
4)     PowerPivot-Ad in for excel
5)     SQL server Management studio – manage sql server

Whenever querying a specific table in standby, use

S SET TRAN ISOLATION LEVEL READ UNCOMMITTED

Integration service fundamentals

Integration services uses – populate dwh, cleaning and standardize data, merge data from multiple source, automative administrative tasks

Architecture

Peform data related task

msdn.microsoft.com/en-us/library/bb522498.aspx

Run time engine execute packages in design environment. Logging using Log providers. Package is main executable items. Sub items in package are individual tasks, tasks inside containers, event handlers (execute only when event is triggered). Also predefined tasks which run as is and also custom tasks.

Engine specifically for data flow task known as pipeline engine. Responsible for physical movement of data. Source -> Transformation - > Destination

Lookup tables

A lookup table is the table placed on the target table based upon the primary key of the target, it just updates the table by allowing only modified (new or updated) records based on the lookup condition.

Aggregate tables

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. Aggregate tables reduce the load in the database server and increases the performance of the query and can retrieve the result quickly.