Wednesday, December 20, 2017

Data warehousing terms

Common BI queries

Relational queries Vs OLAP queries


Before moving to terms used in Data warehouse, let us see the architecture of Data Warehouse.



Data warehouse- Storage of an organization's data for analytical purpose for decision making.
Good data and smart analysis gives right decision.

DWH stores enterprise data for measuring and reporting. It supports report and analysis, maintain history of specific areas in a company, and support high level queries (aggregated data) from dashboards.
Data sources (OLTP, historical data repository, external sources like flat files, excel etc.) -> Data staging (ETL) ->Data storage (DWH, Datamarts)->Data presentation(reporting,OLAP, data mining)
DataMart specific to business areas. Data warehouse consists of multiple datamarts.
Source system (OLTP) -> normalized for transaction purpose. Only current data.
Data warehouse (OLAP) -> contains historical data (aggregated data). So denormalized.
Star schema is denormalized. One fact and dimension table around it.
Snow flake -> one fact and dimensions again split up into different other dimensions.
ETL Architecture

Dimensional Modeling: Dimensional Data Modeling comprises of one or more-dimension tables and fact tables. Good examples of dimensions are location, product, time, promotion, organization etc. A fact (measure) table contains measures (sales gross value, total units sold) and dimension columns. It is a consistent design technique used to build a data warehouse. DM uses facts and dimensions of a warehouse for its design. A snow and star flake schema represent data modeling.

Fact
who, what, when, where
How much – measures which are quantitative measures
 Different Dimension columns in a table and a surrogate key
Aggregate level of data is important    .
A fact is something that is measurable. Facts are typically (but not always) numerical values that can be aggregated.

Dimension
Describe facts
Look up values from traditional OLTP –geography, customers, employees
Changing dimensions
Slowly changing dimensions (employees getting married, have child, change house) – dimensions changing over time.
Static Data – data never changes Eg: color
Type 0 dimension
No history – overwrite 
Type 1 dimension
Impact when no history is needed – business people pulling reports will get confused
Tracking change is important
When a dimension is changed, a new record is inserted, and old one dated. Surrogate id changed. It allows to have primary key multiple times.
Seperating history from day to day needs – type 4 dimensions
History table has its own surrogate key
Conformed Dimensions: when pulling data from multiple systems, three different primary keys have to reconcile. Same product surrogate key.
To summarize, a dimension is something that qualifies a quantity (measure).

For example, consider this: If I just say… “5kg”, it does not mean anything. But if I say, “5kg of Wheat (Product) is sold to Albert (customer) on Dec 25th (date)”, then that gives a meaningful sense. These product, customer and dates are some dimension that qualified the measure – 5kg.

Dimensions are mutually independent and is a data element that categorizes each item in a data set into non-overlapping regions.

Star SchemaFact in center and dimensions surrounded it.

it describe denormalized structure of data warehouse which contains dimension tables and fact tables. Dimension has only single table. There might be a key column to uniquely identify each record in a table. Primary key also called surrogate key in dwh.
In star schema, enter desired facts and all the primary keys of dimensional tables in Fact table. And fact tables are the union of its all dimension table key. In star schema, dimensional tables are usually not in BCNF form. Star flake schemas are normalized to remove any redundancies in the dimensions.





Snowflake Schema
– attributes and dimensions become dimensions itself—difficult to report an performance.

It’s almost like star schema, but in this dimension tables are in 3rd NF and dimensions are further connected to other dimensions tables. And these dimension tables are linked by primary, foreign key relation. The process of removing data redundancy which helps to overcome normal data modeling problems is known as normalization. For 3NF, first, the table must be in 2NF, plus, we want to make sure that the non-key fields are dependent upon ONLY the PK, and no other non-key fields for its existence. An entity is in the second normal form if all its attributes depend on the whole (primary) key. In relational terms, every column in a table must be functionally dependent on the whole primary key of that table. The visual representation of objects in a relational database (usually a normalized) is called as relational data modeling. Table contains rows and columns.




To understand difference between star schema and snowflake schema


Measures are numeric value which are the basis of analysis. Eg: sales, number of items sold, cost of no of items sold. Combining multiple measures into measure group rule is all measures in the same measure group will need to provide consistent granularity. Data for each measure is at the same level.

Non-additive Measures: Non-additive measures are those which cannot be used inside any numeric aggregation function (e.g. SUM (), AVG () etc.). One example of non-additive fact is any kind of ratio or percentage. Example, 5% profit margin, revenue to asset ratio etc. A non-numerical data can also be a non-additive measure when that data is stored in fact tables, e.g. some kind of varchar flags in the fact table.

Semi Additive Measures: Semi-additive measures are those where only a subset of aggregation function can be applied. Let’s say account balance. A sum () function on balance does not give a useful result but max () or min () balance might be useful. Consider price rate or currency rate. Sum is meaningless on rate; however, average function might be useful.

Additive Measures: Additive measures can be used with any aggregation function like Sum (), Avg () etc. Example is Sales Quantity etc.

Junk dimension

A grouping of typically low-cardinality attributes (flags, indicators etc.) so that those can be removed from other tables and can be junked into an abstract dimension table.

These junk dimension attributes might not be related. The only purpose of this table is to store all the combinations of the dimensional attributes which you could not fit into the different dimension tables otherwise. Junk dimensions are often used to implement Rapidly Changing Dimensions in data warehouse.

ERD- Entity Relationship Diagram

Explaining ERD based on an Academic project 'PFEP' at PSMI, Saline. DBA and Project manager explained the data requirements and assumption in the PFEP system from a high-level perspective. Based on the explanation, we created three basic elements.

Entities: “Things” about which we seek information. E.g.: Department is an entity
Attributes: Data we collect about the entities. E.g.: department id, supervisor name, email, phone number, plant_id.
Relationships: Provide the structure needed to draw information from multiple entities. E.g.: department – plant Relationship (Each plant have one or many departments, Each department exist in only one plant) Connected through foreign key relationship.

Creating ERD using this basic element will help to do physical design of tables.

Surrogate Key

Surrogate key is a substitute for natural keys. Instead of having primary key or composite primary keys, the data modelers create a surrogate key; this is very useful for creating SQL queries, uniquely identify a record and superior performance.

Conceptual |Logical | Physical Data Modeling:


Conceptual data model includes all major entities and relationships and does not contain much detailed level of information about attributes and is often used in the initial planning phase. Data Modelers create conceptual data model and forward that model to functional team for their review. The approach by which conceptual data models are created is called as conceptual data modeling.
When a data modeler works with the client, his title may be a logical data modeler or a physical data modeler or combination of both. A logical data modeler designs the data model to suit business requirements, creates and maintains the lookup data, compares the versions of data model, maintains change log, generate reports from data model and whereas a physical data modeler has to know about the source and target databases properties.
A physical data modeler should know the technical-know-how to create data models from existing databases and to tune the data models with referential integrity, alternate keys, indexes and how to match indexes to SQL code. It would be good if the physical data modeler knows about replication, clustering and so on.

Primary key constraint

Primary key constraint is imposed on the column data to avoid null values and duplicate values. Primary Key=Unique + Not Null. Example: social security number, bank account number, bank routing number

Foreign key constraint

Parent table has primary key and a foreign key constraint is imposed on a column in the child table. The foreign key column value in the child table will always refer to primary key values in the parent table.

Relational vs Dimensional

Relational Data Modeling is used in OLTP systems which are transaction oriented and Dimensional Data Modeling is used in OLAP systems which are analytical based. In a data warehouse environment, staging area is designed on OLTP concepts, since data must be normalized, cleansed and profiled before loaded into a data warehouse or data mart. In OLTP environment, lookups are stored as independent tables in detail whereas these independent tables are merged as a single dimension in an OLAP environment like data warehouse.

Steps to create a Data Model

1. Get Business requirements.
2. Create High Level Conceptual Data Model.
3. Create Logical Data Model.
4. Select target DBMS where data modeling tool creates the physical schema.
5. Create standard abbreviation document according to business standard.
6. Create domain.
7. Create Entity and add definitions.
8. Create attribute and add definitions.
9. Based on the analysis, try to create surrogate keys, super types and sub types.
10. Assign datatype to attribute. If a domain is already present, then the attribute should be attached to the domain.
11. Create primary or unique keys to attribute.
12. Create check constraint or default to attribute.
13. Create unique index or bitmap index to attribute.
14. Create foreign key relationship between entities.
15. Create Physical Data Model.
16. Add database properties to physical data model.
17. Create SQL Scripts from Physical Data Model and forward that to DBA.
18. Maintain Logical and Physical Data Model.

Data Modeler Role

  • Business Requirement Analysis
  • Interact with Business Analysts to get the functional requirements. 
  • Interact with end users and find out the reporting needs. 
  • Conduct interviews, brain storming discussions with project team to get additional requirements. 
  • Gather accurate data by data analysis and functional analysis.
  • Development of data model: 
  • Create standard abbreviation document for logical, physical and dimensional data models. 
  • Create logical, physical and dimensional data models (data warehouse data modelling). 
  • Document logical, physical and dimensional data models (data warehouse data modelling). 

Reports

Generate reports from data model.

Review:

Review the data model with functional and technical team.

Creation of database:

Create sql code from data model and co-ordinate with DBAs to create database.

Check to see data models and databases are in synch.

Support & Maintenance:

Assist developers, ETL, BI team and end users to understand the data model.

Maintain change log for each data model.

Data mart

Data marts are generally designed for a single subject area. An organization may have data pertaining to different departments like Finance, HR, Marketing etc. stored in data warehouse and each department may have separate data marts. These data marts can be built on top of the data warehouse.

Slowly changing dimensions (SCD)
SCD is abbreviation of Slowly changing dimensions. SCD applies to cases where the attribute for a record varies over time.
There are three different types of SCD.
1) SCD1 : The new record replaces the original record. Only one record exist in database – current data.
2) SCD2 : A new record is added into the customer dimension table. Two records exist in database – current data and previous history data.
3) SCD3 : The original data is modified to include new data. One record exist in database – new information are attached with old information in same row.

ER Modeling and Dimensional Modeling

ER modeling is used for normalizing the OLTP database design. Dimensional modeling is used for de-normalizing the ROLAP/MOLAP design.

Cube

They are multi-dimensional objects made of measures and dimensions.
Dimensions provide the structure of the cube. They come from Dimension tables in star schema. Measures provide data in the cube. They come from fact tables in star schema and are aggregated at multiple levels along dimension.
Date Dimension: Year-> Quarter -> Month -> Day -> Hour
Understanding Hierarchies
Level (All Products) -> Category Level (Accessories, components) -> Product Level (Accessories- Bike stand, Bike wash), components- Chain, front brakes)
Understanding Measures
Fact table data become cube measures. E.g.: Product, order date, geography, sales etc.
Understanding Aggregations
Performed across levels. Rolls up occur on all dimensions. Aggregations are calculated and results are stored in the cube. Aggregations are performed across all levels across all hierarchies along all dimensions.
 
In other words,cubes are logical representation of multidimensional data. The edge of the cube contains dimension members and the body of the cube contains data values.
Data warehousing and business intelligence
Data warehousing deals with all aspects of managing the development, implementation and operation of a data warehouse or data mart including meta data management, data acquisition, data cleansing, data transformation, storage management, data distribution, data archiving, operational reporting, analytical reporting, security management, backup/recovery planning, etc. Business intelligence, on the other hand, is a set of software tools that enable an organization to analyze measurable aspects of their business such as sales performance, profitability, operational efficiency, effectiveness of marketing campaigns, market penetration among certain customer groups, cost trends, anomalies and exceptions, etc. Typically, the term “business intelligence” is used to encompass OLAP, data visualization, data mining and query/reporting tools.
View Vs Materialized View
A view takes the output of a query and makes it appear like a virtual table and it can be used in place of tables.
A materialized view provides indirect access to table data by storing the results of a query in a separate schema object.
ODS is abbreviation of Operational Data Store. A database structure that is a repository for near real-time operational data rather than long term trend data. The ODS may further become the enterprise shared operational database, allowing operational systems that are being re-engineered to use the ODS as their operation databases.

ETL
ETL is abbreviation of extract, transform, and load. ETL is software that enables businesses to consolidate their disparate data while moving it from place to place, and it doesn’t really matter that that data is in different forms or formats. The data can come from any source. ETL is powerful enough to handle such data disparities. First, the extract function reads data from a specified source database and extracts a desired subset of data. Next, the transform function works with the acquired data – using rules or lookup tables, or creating combinations with other data – to convert it to the desired state. Finally, the load function is used to write the resulting data to a target database.

Summary
1)     ETL – Extract, Transform and Load
           SSIS – SQL server Integration services
2)     Analytics – Aggregation – Trending - Correlations
SSAS – SQL server analysis services- Develop analysis service db, dimensions, cubes using ssas and deploy the cube then connect to the server in ssms . open the deployed cube. Select all dimensions and process it. During processing, data is pulled from dwh and stored in ssas dimensions and cubes. Then browse the cube using cube browser in BIDS.

Database development flow in analysis services

BIDS(Business intelligence development studio) –>Analysis services project

3)     Reporting
SSRS-SQL server reporting services
Sharepoint performance point
4)     PowerPivot-Ad in for excel
5)     SQL server Management studio – manage sql server
Whenever querying a specific table in standby, use
S     SET TRAN ISOLATION LEVEL READ UNCOMMITTED    


Integration service fundamentals
Integration services uses – populate dwh, cleaning and standardize data, merge data from multiple source, automative administrative tasks
Architecture
Peform data related task
Run time engine execute packages in design environment. Logging using Log providers. Package is main executable items. Sub items in package are individual tasks, tasks inside containers, event handlers (execute only when event is triggered). Also predefined tasks which run as is and also custom tasks.
Engine specifically for data flow task known as pipeline engine. Responsible for physical movement of data. Source -> Transformation - > Destination
     
     Lookup tables

A lookup table is the table placed on the target table based upon the primary key of the target, it just updates the table by allowing only modified (new or updated) records based on the lookup condition.

Aggregate tables

Aggregate table contains the summary of existing warehouse data which is grouped to certain levels of dimensions. It is always easy to retrieve data from aggregated tables than visiting original table which has million records. Aggregate tables reduce the load in the database server and increases the performance of the query and can retrieve the result quickly.

46 comments:

  1. Thanks for sharing the nice and informative post. I have clear my all doubt after reading the post.
    Data Warehouse Services

    ReplyDelete
  2. The business process involved in the data warehouse solution providers helps in keeping the robust data security along with efficient data modelling and problem-solving techniques.

    ReplyDelete
  3. This is very educational content and written well for a change. It's nice to see that some people still understand how to write a quality post! Tableau Data Blending

    ReplyDelete
  4. Thanks everyone for your valuable comments. This made me the inspiration to write more.

    ReplyDelete
  5. It's really nice and meanful. it's really cool blog. Linking is very useful thing.you have really helped lots of people who visit blog and provide them usefull information.
    data science course

    ReplyDelete
  6. It's late finding this act. At least, it's a thing to be familiar with that there are such events exist. I agree with your Blog and I will be back to inspect it more in the future so please keep up your act.data science certification

    ReplyDelete
  7. It's late finding this act. At least, it's a thing to be familiar with that there are such events exist. I agree with your Blog and I will be back to inspect it more in the future so please keep up your act.financial analytics course malaysia

    ReplyDelete
  8. stunning, incredible, I was thinking about how to fix skin inflammation normally.I've bookmark your site and furthermore include rss. keep us refreshed.
    data science course in malaysia

    ReplyDelete
  9. If you don't mind, then continue this excellent work and expect more from your great blog posts
    machine learning masters


    ReplyDelete
  10. Really Good blog post.provided a helpful information. Warehouse Space in Gurugram

    ReplyDelete
  11. Nice article. I liked very much. All the information given by you are really helpful for my research. keep on posting your views.
    data scientist certification

    ReplyDelete
  12. A data warehouse is a subject area repository, a place where all the information about a given topic is stored. Anenterprise data warehouse is one that is designed to hold information from multiple data sources, with the idea being that it is more comprehensive than smaller data warehouses that are used by individual departments.

    ReplyDelete
  13. Great blog. Please findout our services warehouse services

    ReplyDelete
  14. This post is so helfull and informative.keep updating with more information...
    Logistics & Distribution Services
    Warehouse Services

    ReplyDelete
  15. Wonderful Post. This is a very helpful post. These are the useful tips for. I would like to share with my friends. To know more about me visit here warehouse and fulfillment services
    warehouse management services
    packaging labelling

    ReplyDelete
  16. Thanks for posting the useful information to my vision. This is excellent information
    Warehouse Services
    Warehousing Services

    ReplyDelete
  17. I simply must tell you that you have written an excellent and unique article that I really enjoyed reading. I’m fascinated by how well you laid out your material and presented your views. Thank you. click site

    ReplyDelete
  18. This particular papers fabulous, and My spouse and i enjoy each of the perform that you have placed into this. I’m sure that you will be making a really useful place. I has been additionally pleased. Good perform! tape storage to buy

    ReplyDelete
  19. Thank you for sharing this impormation
    labelling services at affordable cost in UK

    ReplyDelete
  20. Nanoprecise’s predictive maintenance solutions work towards achieving productive results from day one with a combination of AI + IoT + LTE driven seamless monitoring, advanced sensors, and prescriptive diagnostics.

    Automated AI based Predictive Maintenance



    We at Nanoprecise offers quality Artificial intelligence based, condition monitoring, predictive maintenance approaches to solve the machinery defects.

    Machine Health Monitoring



    Predictive Maintenance With Iot

    ReplyDelete
  21. I definitely enjoying every little bit of it. It is a great website and nice share. I want to thank you. Good job! You guys do a great blog, and have some great contents. Keep up the good work. Meridian

    ReplyDelete
  22. It is really a helpful blog to find some different source to add my knowledge.
    Warehousing Storage UK

    ReplyDelete
  23. Excellent article... Thank you for providing such valuable information; the contents are quite intriguing.
    Data Engineering Services 
    Data Analytics Solutions
    Data Modernization Solutions
    AI & ML Service Provider

    ReplyDelete
  24. Your blog provided us with valuable information to work with. Each & every tips of your post are awesome. Thanks a lot for sharing. Keep blogging,
    al noor orchard west marina location

    ReplyDelete
  25. The term "Data Warehousing" refers to the practise of integrating data from several sources into a single database.

    ReplyDelete
  26. Thank you for sharing such great information but I would like to add that AsiaPack offers the Best consolidation warehouse services, as well as superior fulfilment services at reasonable pricing. providing services to fulfil the demands of clients This allows you to simplify logistics and shipping while enhancing delivery efficiency, management, and accessibility.

    ReplyDelete
  27. Great Post! Thanks for sharing such amazing information with us. Please keep sharing.
    fulfilment warehouse usa

    ReplyDelete
  28. Great article, thank you for sharing the valuable information.
    Top Business Intelligence Services Company/a>

    ReplyDelete

GEN AI

  Stay Tuned....