Data Warehousing and Database: 2019

1) Difference between Char and VARCHAR

VARCHAR is variable-length. Uses dynamic memory allocation*. For example if you declare a variable/column of VARCHAR (100) data type, it will take the no. of bytes equal to the number of characters stored in this column. So, in this column if you are storing only one character then it will take only one byte and if we are storing 100 characters then it will take 100 bytes.It can store maximum 8000 Non-Unicode characters

CHAR is fixed length.string length in bytes and must be a value from 1 through 8,000. Uses static memory allocation.For example if you declare a variable/column of CHAR (100) data type, then it will always take 100 bytes irrespective of whether you are storing 1 character or 100 character in this variable or column. Can store Maximum of 8000 Characters.

A CHAR(x) column can only have exactly x characters.

A VARCHAR(x) column can have up to x characters.

Use char when the sizes of the column data entries are consistent.

Use varchar when the sizes of the column data entries vary considerably.

Use varchar(max) when the sizes of the column data entries vary considerably, and the string length might exceed 8,000 bytes.
NCHAR[(n)]-Fixed Length UNicode Character Data Type. MAX OF 4000 CHARACTERS. Each character will take 2 bytes.
NVARCHAR[(n)]-UNicode Variable Length Character Data Type.MAX OF 4000 CHARACTERS.Each character will take 2 bytes.
Unicode ( i.e. Japanese, Korean etc double byte)

2)Difference between PRIMARY KEY and UNIQUE KEY:

Primary key- It doesn’t allow Null values. PRIMARY KEY = UNIQUE KEY + Not Null CONSTRAINT. By default it adds a clustered index.A table can have only one PRIMARY KEY Column[s]

Unique Key- Allows Null value. But only one Null value. By default it adds a unique non clustered index.A table can have more than one UNIQUE Key Column[s].

create table table1

col1 int

col2 varchar(100)

can we create clustered index in this table?

yes, we can create clustered index.

Heap tables are tables without a Clustered Index. A table in SQL Server can have a Clustered Index, then it's called a Clustered Table, and without a Clustered Index, it's called a Heap Table. In a heap table, the data is not sorted in any way, it's just a pile of unordered, unstructured records.
4) Clustered AND NON clustered index
https://www.youtube.com/watch?v=NGslt99VOCw

5)Lookup vs Left Join
In star schema design, to fetch dimension surrogate keys from dimension table to fact table,choose “Lookup” component in SSIS or T-SQL “JOIN”

SSIS can well handle even the reference dataset coming from a heterogeneous database like text file, DB2 etc. TSQL can be used with more complicated business logic especially in the case of dependency between two datasets. Use Full Cache mode for good performance. Do NOT output the unnecessary columns to downstream. caches data in memory to provide fast access and manipulation in RAM and avoids having to go to disk multiple times.

Left join is recommended when SSIS buffer cannot hold all dimension data(>1M rows), WHEN SORT, AGGREGATION etc is used, when data loads based on the complicated dependency. Consider index on source table and in ETL solution when using Left join.

6)declare i char(4000)

set @i='jis'

Is this possible?

yes, it is. But will take memory allocation and not recommended.
7) What is table partition?
Table partitioning is a way to divide a large table into smaller, more manageable parts without having to create separate tables for each part. Data in a partitioned table is physically stored in groups of rows called partitions and each partition can be accessed and maintained separately.

Why partition?

Primarily, performance reasons. Because the table is split up, data can be retrieved in a faster and more organized way when the query optimizer sees that partition. WHERE clause BIT data type = 1 would search through only that partition. This cuts back seeks and scans dramatically.Partitioning is best used with an extremely high level of records in the data.

How?

Right click on table-> Storage->Create partition

choose the column to partition.

create partition function.

create partition scheme.

Decide between a Left or Right boundary. If you choose a Left boundary, your data will have a greater than or equals to condition for storing values in that partition or not. A Right boundary will be strictly a greater than condition.Choose a File Group to partition to. Ensure you have more than one File Group to partition to, 1 for your conditional boundary records and another for your non conditional records.Choose to either script out your selections to a code, run your partitioning immediately, or schedule out the partition in the future.

8) Why Backup and restore?

Backup- To protect databases against data loss caused by a variety of failures such as media failure, user errors, hardware failures, natural disasters etc. Test your strategy by restoring a set of backups and then recovering your database to prepare you to respond effectively to a disaster.

Restore-A multi-phase process that copies all the data and log pages from a specified SQL Server backup to a specified database, and then rolls forward all the transactions that are logged in the backup by applying logged changes to bring the data forward in time.

Advantages: maximizes data availability and minimizes data loss.

We can create and restore backups using SSMS and T-SQL.

The RPO (Recovery-Point-Objective) and RTO (Recovery-Time-Objective) are vital parameters to consider when planning the backup strategy.

The recovery point objective, or RPO, is the point in time that data can be restored to. recovery time objective, (or RTO) is the amount of time taken to perform the restoration.

9)Stored Procedures and functions

https://realworlddwh.blogspot.com/2018/07/stored-procedure-in-database.html

Functions:-Function will allow only Select statement, it will not allow us to use DML statements. Also, it will allow only input parameters, doesn’t support output parameters.Transactions are not allowed within functions. Transactions are not allowed within functions. Stored procedures can’t be called from function. it can be called from select statement. User defined functions can be used in join clause as a result set. Useful with specialized data types such as images and geometric objects.

Stored Procedure can have select statements as well as DML statements such as insert, update, deletes etc.It can have both input and output parameters. Transactions can be used in SP's. Table variables and temporary variables can be used. SP's can call functions. Procedures can’t be called from Select/Where/Having etc statements. Execute/Exec statement can be used to call/execute stored procedure. it can't be used in join clause. Prepared SQL code that you save so you can reuse the code repeatedly. To running the same SQL code repeatedly, we can pass parameters to the stored procedure, so depending on what the need is the stored procedure can act accordingly based on the parameter values that were passed.

Usage of Stored Procedures in SSIS - preparing tables (truncate), audit tasks (usually part of SSIS framework), getting configuration values for loops

10) difference between parameters and variables in ssis

In simple words

value that you pass is a PARAMETER

value that you declare is a VARIABLE

Parameters - The value of a parameter can be set at development time, after the package is deployed to the SSIS server (via SSMS) or at run time (SSMS or SS agent job).

Project: available to all packages within the SSIS project.(like global parameters).Project parameters are new with SSIS 2012. A project parameter can be shared among all of the packages in an SSIS project. You want to use a project parameter when, at run-time, the value is the same for all packages.
All parameters can be specified in project.params.
Usage - easy to share a given value, such as the path to a file share or the name of a server, across all packages in a project.

Package. These type of parameters are meant to affect only the package on which they were defined.Package parameters are also new with SSIS 2012. A package parameter is exactly the same as a project parameter – except that the scope of a package parameter is the individual package it resides in. You want to use a package parameter when, at run-time, the value is different for each package.

project parameters do *not* have an expressions property to define their value. They are intended to hold a literal value which does not change while the package executes.

Value of the parameter remains the same for the entire execution of the package.

Variables - Value of variable can be set in many ways within a package and can be referenced by other containers, expressions, tasks, components or event handler level. The value of variables can change during the execution of the package, which make them suitable for scenarios.

Eg: used in conditional, looping tasks
SSIS->VARIABLES

Parameters do not support the datatypes char, DBnull, object.

Within a package, SSIS variables have an expression property. The expression property, and the ability to change values during the execution of a package if needed, are two fundamental differences variables have from parameters. A variable can be scoped to the package or an object within the package.Variables often consume values from parameters within their expressions.Variables can also communicate values to objects like Execute SQL tasks, For Each containers, Send Mail tasks, and so on.

https://www.sqlchick.com/entries/2013/9/15/getting-started-with-parameters-variables-configurations-in.html

11) difference between "fuzzy lookup" and "fuzzy grouping"
The Fuzzy Grouping task performs the same operations as the Fuzzy Lookup task but instead of evaluating input records against an outside reference table, the input set becomes the reference. Input records are therefore evaluated against other records in the input set and evaluated for similarity and assigned to a group.
The very basic purpose of the Fuzzy Lookup is to match input data to a lookup table whose columns you are matching across that do not necessarily match exactly. The Fuzzy Lookup Transformation is therefore very similar to the Lookup Transformation, except you are not joining with identical values; you are joining with similar values.

12)Difference between timestamp, smalldatetime [4 byte storage, YYYY-MM-DD hh:mm:ss, accuracy 1 minute] , datetime[8 byte storage, YYYY-MM-DD hh:mm:ss.nnn, accuracy 0.0033 seconds], datetime2[6 to 8 byte storage, Accuracy 100 nanoseconds)- DateTime2 is the new Data Type introduced in Sql Server 2008 for storing Date and Time value.

WHY DATETIME2?

Use the time, date, datetime2 and datetimeoffset data types for new work. These types align with the SQL Standard. They are more portable. time, datetime2 and datetimeoffsetprovide more seconds precision. datetimeoffset provides time zone support for globally deployed applications.[MSDN ] datetime2 has larger date range, a larger default fractional precision, and optional user-specified precision. Also depending on the user-specified precision it may use less storage. (Ref: Stack Overflow)

DATETIME2 HAS

larger range of values
better Accuracy (100 nanoseconds)
smaller storage space (6 to 8 bytes)
SELECT SYSDATETIME() returns DB Current Date and Time of DateTime2 Data Type where as SELECT GETDATE() returns DB Current Date and Time of DateTime Data Type.

SELECT CURRENT_TIMESTAMP

2019-05-09 05:49:06.090

SELECT SYSDATETIME() AS DATETIME2

2019-05-09 05:49:06.0905358

SELECT GETDATE() AS DATETIME

2019-05-09 05:49:06.090

TIMESTAMP

SELECT

CURRENT_TIMESTAMP AS current_date_time;

current_date_time

2019-05-08 15:30:57.543

Examples:

DECLARE @dt DATETIME

SELECT @dt = '2019-05-08 23:22:12'

SELECT @dt AS [DateTime],

CAST(@dt AS DATE) AS [Date]

DECLARE @Date DATETIME

SET @Date = GETDATE()

PRINT @Date

May 6 2019 8:22PM

SELECT GETDATE()

2019-05-06 20:46:34.327

SELECT SYSDATETIME()

2019-05-06 20:47:13.1734039

DECLARE @Date DATETIME2

SET @Date = GETDATE()

PRINT @Date

2019-05-06 20:25:15.4330000

To get time from todays date

SELECT CONVERT(TIME, GETDATE()) 

      AS 'Current TIME using GETDATE()'

Current TIME using GETDATE()
22:30:14.5170000

SELECT CONVERT (DATE, GETDATE()) 'Date Part Only'

Date Part Only
2019-05-09

https://sqlhints.com/2016/09/24/how-to-get-current-time-in-sql-server/

We cannot add or subtract days using DateTime2, but we can using DateTime. For DateTime2 , use DateAdd Function.

DECLARE @nowDateTime

DATETIME = GETDATE()

SELECT @nowDateTime + 1

2019-05-07 20:49:43.287

DateTime2 with fractional seconds precision of 3 is same as DateTime data type. And DateTime2(3) uses 7 bytes of storage instead of 8 byte which old DateTime datatype uses and it also provides higher date range (i.e. 0001-01-01 to 9999-12-31 ) compared to DateTime data type.

SELECT CONVERT(DATE,GETDATE())

2019-05-08

sp_help table will show all the details.

13) How to handle SQL server errors?

It includes the usage of common functions to return information about the error and using the TRY CATCH block in stored procedures and transactions.

14) Will Triggers affect performance of a table?

Yes. The database will have more work to do so performance will be impacted. But the impact may be unnoticeable to your end users. The trigger is mostly used for maintaining the integrity of the information on the database. triggers will affect performance in insert, update and delete operations. but doesn't affect read performance.

Triggers improve performance the most when they execute more SQL statements and the network speed is comparatively slow. When the database server executes an SQL statement, it must perform the following actions: Determine if triggers must be fired.

Magic tables in triggers

Usually, inserted and deleted tables are called Magic Tables.These tables are automatically created and managed by SQL Server internally to hold recently inserted, deleted and updated values during DML operations (Insert, Update and Delete) on a database table.

Inserted magic table

The Inserted table holds the recently inserted values, in other words new data values. Hence recently added records are inserted into the Inserted table.Deleted magic table

The Deleted table holds the recently deleted or updated values, in other words old data values. Hence the old updated and deleted records are inserted into the Deleted table.

15) maximum number of columns in a table?

Wide Tables are considered to be denormalized tables, Non-wide tables are considered to be Normalized tables. Wide table can contain 30,000 columns, Non-wide table(basic table) can contain only 1024 columns.

SQL Server supports a maximum of 1,024 Columns in a base table (non wide)

SQL Server supports a maximum of 30,000 Columns in a base table (wide)

Maximum number of triggers on a table-300

Clustered indexes per table-1

16) will insert work in functions?

17) will insert work in procedures?

yes

18) Second Highest Salary of an employee

select MAX(Salary) from Customer where Salary < (Select max(Salary) from Customer)

SELECT TOP 1 Salary from

(SELECT DISTINCT TOP 2 Salary from Customer ORDER BY Salary desc) T

ORDER BY Salary ASC

To get all details,

SELECT TOP 1* from (SELECT DISTINCT TOP 2* FROM Customer order by Salary DESC) T

order by Salary ASC

4th highest salary of employee?

SELECT TOP 1* from (SELECT DISTINCT TOP 4* FROM Customer order by Salary DESC) T

order by Salary ASC

nth highest salary of employee

SELECT TOP 1* from (SELECT DISTINCT TOP n* FROM Customer order by Salary DESC) T

order by Salary ASC

19) How many clustered index in one table?

1. But we can use more than one columns to make unique clustered index.

How many non clustered index in one table?

999- sql server 2008

249-sql server 2005

20) Can we compare datetime and timestamp columns?

21) Transformations in SSIS

DATACONVERSION: Converts columns data types from one to another type. It stands for Explicit Column Conversion.

DATAMININGQUERY: Used to perform data mining query against analysis services and manage Predictions Graphs and Controls.

DERIVEDCOLUMN: Create a new (computed) column from given expressions.

EXPORTCOLUMN: Used to export an Image specific column from the database to a flat file.

FUZZYGROUPING: Used for data cleansing by finding rows that are likely duplicates.

FUZZYLOOKUP: Used for Pattern Matching and Ranking based on fuzzy logic.

AGGREGATE: It applies aggregate functions to Record Sets to produce new output records from aggregated values.

AUDIT: Adds Package and Task level Metadata: such as Machine Name, Execution Instance, Package Name, Package ID, etc.

CHARACTERMAP: Performs SQL Server column level string operations such as changing data from lower case to upper case.

MULTICAST: Sends a copy of supplied Data Source onto multiple Destinations.This transformation sends output to multiple output paths with no conditional as Conditional Split does. Takes ONE Input and makes the COPY of data and passes the same data through many outputs. In simple give one input and take many outputs of the same data.

CONDITIONALSPLIT: Separates available input into separate output pipelines based on Boolean Expressions configured for each output. It functions as if…then…else construct. It enables send input data to a satisfied conditional branch. For example you want to split employee salary between less than 50000 and greater or equal to 50000. You can give the conditional a name that easily identifies its purpose. Else section will be covered in Default Output Column name.After you configure the component, it connect to subsequent transformation/destination, when connected, it pops up dialog box to let you choose which conditional options will apply to the destination transformation/destination.

COPYCOLUMN: Add a copy of column to the output we can later transform the copy keeping the original for auditing.

IMPORTCOLUMN: Reads image specific column from database onto a flat file.

LOOKUP: Performs the lookup (searching) of a given reference object set to a data source. It is used for exact matches only.

MERGE: Merges two sorted data sets into a single data set into a single data flow.Merge transformation merges two paths into single path. It is useful when you want to break out data into path that handles errors after the errors are handled, the data are merge back into downstream or you want to merge 2 data sources. It is similar with Union All transformation, but data should be sorted and Data type, data length and other Meta data attribute must be similar before merging.

MERGEJOIN: Merges two data sets into a single dataset using a join junction. Merge Join transformation will merge output from 2 inputs and doing INNER or OUTER join on the data. But if you the data come from 1 OLEDB data source, it is better you join through SQL query rather than using Merge Join transformation. Merge Join is intended to join 2 different data source.

ROWCOUNT: Stores the resulting row count from the data flow / transformation into a variable.

ROWSAMPLING: Captures sample data by using a row count of the total rows in dataflow specified by rows or percentage.

UNIONALL: Merge multiple data sets into a single dataset.

PIVOT: Used for Normalization of data sources to reduce anomalies by converting rows into columns

UNPIVOT: Used for de-normalizing the data structure by converts columns into rows in case of building Data Warehouses
22) What are Precedence Constraints?

A task will only execute if the condition that is set by the precedence constraint preceding the task is met. By using these constraints, it will choose different execution paths depending on the success or failure of other tasks.

Success – Workflow will proceed when the preceding containter executes successfully. Indicated in control flow by a solid green line.
Failure – Workflow will proceed when the preceding container’s execution results in a failure. Indicated in control flow by a solid red line.
Completion – Workflow will proceed when the preceding container’s execution completes, regardless of success or failure. Indicated in control flow by a solid blue line.
Expression/Constraint with logical AND – workflow will proceed when specified expression and constraints evaluate to true. Indicated in control flow by a solid color line along with a small ‘fx’ icon next to it. Color of line depends on logical constraint chosen (e.g. success = green, completion = blue).

23) what is Checkpoint?

A checkpoint is the property in SSIS which enables the project to restart from the point of failure. When we set the property to true package create the checkpoint file which stores the information about package execution and use to restart package from the point of failure. If the package runs successfully, the checkpoint file is deleted, and then re-created the next time the package runs.checkpoints do not save ForEach and For loops.t They’re like safeguards for the project, in case something doesn’t go according to plan.

24) Explain what breakpoints are and how you would use them.

Breakpoints put pauses in your package. It’s a great tool for debugging a package because you can place a break point on a task and it will pause the package based on execution events.

A reason in which I have used breakpoints is when I have a looping container and I want to see how my variables are changed by the loop. I would place a watch window on the package and type the variable name in. Set a break point on the container the stop after each iteration of the loop.

25) How do you handle errors in ssis?

When a data flow component applies a transformation to column data, extracts data from sources, or loads data into destinations, errors can occur. Errors frequently occur because of unexpected data values.

Errors typically fall into one the following categories:

Data conversion errors: occurs if a conversion results in loss of significant digits, the loss of insignificant digits, and the truncation of strings. Data conversion errors also occur if the requested conversion is not supported.

Expression evaluation errors: occurs if expressions that are evaluated at run time perform invalid operations or become syntactically incorrect because of missing or incorrect data values.

Lookup errors: occurs if a lookup operation fails to locate a match in the lookup table.

Many data flow components support error outputs, which let you control how the component handles row-level errors in both incoming and outgoing data. You specify how the component behaves when truncation or an error occurs by setting options on individual columns in the input or output.

26) Mention how would you deploy an SSIS package on production?

To deploy SSIS package we need to execute the manifest files and need to determine whether to deploy this into File System or onto SQL Server. Alternatively you can also import package from SSMS from SQL Server or File System.

27)Differentiate between Union All and Merge operations.

Merge transformation can accept only two inputs whereas Union all can take more than two inputs.

Data has to be sorted before Merge Transformation whereas Union all doesn’t have any condition like that.

Data from two paths can be merged into a single path by Merge transformation. This transform is found useful when the data flows through a path that can handle certain errors and merge back it into the main data source.

Prior to merging transformation data should be sorted that can be done by using sorting transformation. For all paths metadata must be the same as the customer id field type cannot be numeric type is one path and character type in another.

Union All transformation works like the Merge transformation, but here in this transformation sorted data is not required. Here the output from multiple sources is taken and then it is transformed and combined to form a single result set.

28) What are Lookup Caches?

The full cache mode (The database is queried once during the pre-execute phase of the data flow. The entire reference set is pulled into memory.) setting is the default cache mode selection in the SSIS lookup transformation. Like the name implies, full cache mode will cause the lookup transformation to retrieve and store in SSIS cache the entire set of data from the specified lookup location. As a result, the data flow in which the lookup transformation resides will not start processing any data buffers until all of the rows from the lookup query have been cached in SSIS. (With a moderately sized set of reference data, a lookup transformation using full cache mode usually performs well. Full cache mode does not require multiple round trips to the database, since the entire reference result set is cached prior to data flow execution.)

Full cache mode is ideally useful when one or all of the following conditions exists

The size of the reference data set is small to moderately sized
The size of the pipeline data set (the data you are comparing to the lookup table) is large, is unknown at design time, or is unpredictable
Each distinct key value(s) in the pipeline data set is expected to be found multiple times in that set of data

When using the partial cache (In this mode, the lookup cache starts off empty at the beginning of the data flow. When a new row comes in, the lookup transform checks its cache for the matching values. If no match is found, it queries the database. If the match is found at the database, the values are cached so they can be used the next time a matching row comes in.) setting, lookup values will still be cached, but only as each distinct value is encountered in the data flow. Initially, each distinct value will be retrieved individually from the specified source, and then cached. To be clear, this is a row-by-row lookup for each distinct key value(s).

Using partial cache mode is ideally suited for the conditions below:

The size of the data in the pipeline (more specifically, the number of distinct key column) is relatively small

The size of the lookup data is too large to effectively store in cache

The lookup source is well indexed to allow for fast retrieval of row-by-row values

selecting no cache mode will not add any values to the lookup cache in SSIS. As a result, every single row in the pipeline data set will require a query against the lookup source. Since no data is cached, it is possible to save a small amount of overhead in SSIS memory in cases where key values are not reused.

no cache mode only when all of the below conditions are true:

The reference data set is too large to reasonably be loaded into SSIS memory

The pipeline data set is small and is not expected to grow

There are expected to be very few or no duplicates of the key values(s) in the pipeline data set (i.e., there would be no benefit from caching these values)

29) Difference between Synchronous Transformation and Asynchronous Transformation

Synchronous Transformations:

A synchronous transformation processes incoming rows and passes them on in the data flow one row at a time. Output is synchronous with input, it occurs at the same time. Eg: Data Conversion transformation. For each incoming row, it converts the value in the specified column and sends the row on its way.
Asynchronous Transformations:

The output buffer or output rows are not in sync with the input buffer; output rows use a new buffer. Eg: Sort, Merge, Aggregate.
Asynchronous components can further be divided into the two types described below:

Partially Blocking Transformation – the output set may differ in terms of quantity from the input set. Thus new buffers need to be created to accommodate the newly created set.

Blocking Transformation – a transformation that must hold one or more buffers while it waits on one or more buffers, before it can pass that buffer down the pipeline. All input records must read and processed before creating any output records. For example, a sort transformation must see all rows before sorting and block any data buffers from being passed down the pipeline until the output is generated.

30) Difference between Temporary Table (#) and Table Variable (@) in Sql Server

Both are used for temporary holding/storage of the result-set for further processing and stored in TempDB. Temporary Table structure can be changed after it’s creation it implies we can use DDL statements ALTER, CREATE, DROP. But Table Variables doesn’t support DDL statements like ALTER, CREATE, DROP etc. Temporary Tables are not allowed in User Defined Functions, but table variables can be used.Temporary table supports adding Indexes after table creation, but table variable doesn't. There are local temporary variables (#)[dropped automatically once session ends] and global temporary variables (##)[visible to all other sessions]

31) what is the maximum value which a char datatype can hold?
Variables

DECLARE @i char(8000) -----maximum value which char can hold is 8000

set @i='jis'

print @i

Result: jis

32) types of Datatypes in sql server

Numeric, datetime, character, binary

33) Types of system Databases

System Databases

master- Records all system-level information for an instance of SQL Server.

msdb- Used by SQL Server Agent for scheduling alerts and jobs.

model-Modifications made to the model database, such as database size, collation, recovery model, and other database options, are applied to any databases created afterward.
tempdb-It is used for holding temporary objects or intermediate result-sets.

34) What are the Types of joins
INNER JOIN
LEFT OUTER JOIN
RIGHT OUTER JOIN
FULL OUTER JOIN
CROSS JOIN-Cross join is also referred to as Cartesian Product. For every row in the LEFT Table of the CROSS JOIN all the rows from the RIGHT table are returned and Vice-Versa (i.e.result will have the Cartesian product of the rows from join tables).
SELF JOIN-If a Table is joined to itself using one of the join types explained above, then such a type of join is called SELF JOIN

The FULL OUTER JOIN keyword return all records when there is a match in either left (table1) or right (table2) table records.

SELECT column-names

  FROM table-name1 FULL JOIN table-name2 

    ON column-name1 = column-name2

 WHERE condition

Note: FULL OUTER JOIN can potentially return very large result-sets!

The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match.

SELECT column-names

  FROM table-name1 LEFT JOIN table-name2 

    ON column-name1 = column-name2

 WHERE condition

The RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match.

A self JOIN is a regular join, but the table is joined with itself.

The INNER JOIN keyword selects records that have matching values in both tables.

SELECT column-names

FROM table-name1 INNER JOIN table-name2 

ON column-name1 = column-name2

WHERE condition

Eg:

Table 1

ID | NAME     | AGE | ADDRESS   | SALARY   |
+----+----------+-----+-----------+----------+
|  1 | Jis      |  32 | Bhandup   |  2000.00 |
|  2 | Jomini   |  25 | Delhi     |  1500.00 |
|  3 | komalam  |  23 | Kota      |  2000.00 |
|  4 | Chaithan |  25 | Mumbai    |  6500.00 |
|  5 | Haven    |  27 | Bhopal    |  8500.00 |
|  6 | Kimma    |  22 | MP        |  4500.00 |

Table 2:

+-----+---------------------+-------------+
|OID  | DATE        CUSTOMER_ID | AMOUNT |
+-----+-------------------------+--------+
| 102 | 2009-10-08            3 |   3000 |
| 100 | 2009-10-08            3 |   1500 |
| 101 | 2009-11-20            2 |   1560 |
| 103 | 2008-05-20            4 |   2060 | 
SELECT ID, NAME, AGE, AMOUNT

 FROM Table 1, Table 2

WHERE  Table 1.ID = Table 2.CUSTOMER_ID;

+----+----------+-----+--------+
| ID | NAME     | AGE | AMOUNT |
+----+----------+-----+--------+
|  3 | komalam  |  23 |   3000 |
|  3 | komalam  |  23 |   1500 |
|  2 | Jomini   |  25 |   1560 |
|  4 | Chaithan |  25 |   2060 |
+----+----------+-----+--------+
SELECT ID, NAME,  AMOUNT, DATE

FROM Table 1 LEFT JOIN Table 2

WHERE  Table 1.ID = Table 2.CUSTOMER_ID;

+----+----------+--------+---------------------+
| ID | NAME     | AMOUNT | DATE                |
+----+----------+--------+---------------------+
|  1 | Jis      |   NULL | NULL                |
|  2 | Jomini   |   1560 | 2009-11-20 
|  3 | komalam  |   3000 | 2009-10-08 
|  3 | komalam  |   1500 | 2009-10-08 
|  4 | Chaithan |   2060 | 2008-05-20 
|  5 | Haven    |   NULL | NULL                |
|  6 | Kimma    |   NULL | NULL  
SELECT ID, NAME,  AMOUNT, DATE

FROM Table 1 RIGHT JOIN Table 2

WHERE  Table 1.ID = Table 2.CUSTOMER_ID;

+----+----------+--------+---------------------+
| ID | NAME     | AMOUNT | DATE                |
+----+----------+--------+---------------------+
|  3 | Komalam  |   3000 | 2009-10-08 
|  3 | Komalam  |   1500 | 2009-10-08 
|  2 | Jomini   |   1560 | 2009-11-20 
|  4 | Chaithan |  2060 | 2009-10-08 

SELECT ID, NAME,  AMOUNT, DATE

FROM Table 1 FULL JOIN Table 2

WHERE  Table 1.ID = Table 2.CUSTOMER_ID;

+------+----------+--------+---------------------+
| ID   | NAME     | AMOUNT | DATE                |
+------+----------+--------+---------------------+
|    1 | Jis      |   NULL | NULL                
|    2 | Jomini   |   1560 | 2009-11-20 
|    3 | Komalam  |   3000 | 2009-10-08  
|    3 | Komalam  |   1500 | 2009-10-08 
|    4 | Chaithan |   2060 | 2008-05-20 
|    5 | Haven    |   NULL | NULL                
|    6 | Kimma    |   NULL | NULL                
|    3 | komalam  |   3000 | 2009-10-08 
|    3 | komalam  |   1500 | 2009-10-08 
|    2 | Jomini   |   1560 | 2009-11-20 
|    4 | Chaithan |   2060 | 2008-05-20 
+------+----------+--------+---------------------+

35) Union Vs Union All

Union  is used to combine multiple result sets into one result set and will remove any duplicates rows that exist.

Union All - is used to combine multiple result sets into one result set, but it does not remove any duplicate rows. 

If we know that all the records returned by our query is unique from union then use UNION ALL operator instead of UNION Operator to improve performance, as UNION will sort and remove duplicate will affect performance.
same number of columns, same datatype, ORDER BY CLAUSE in overall result set.

36) Clustered and Non Clustered Index
A clustered index is a special type of index that reorders the way records in the table are physically stored. Therefore table can have only one clustered index because the data can only be stored and sorted one way per table.. The leaf nodes of a clustered index contain the data pages. It is like a dictionary, where all words are sorted in an alphabetical order
The Non-Clustered index is an index structure separate from the data stored in a table that reorders one or more selected columns. The non-clustered index is created to improve the performance of frequently used queries not covered by clustered index.
 The number of columns in the clustered (or non clustered) index can have significant performance implications with heavy INSERT, UPDATE and DELETE activity in your database.

 A table that does not have a clustered index is referred to as a HEAP and a table that has a clustered index is referred to as a clustered table.

In Heap tables, data is not stored in any particular order. so data cannot retrieve easily. Data pages are not linked. In clustered tables, data is stored based on clustered index key, data can be retrieved easily based on index columns, data pages are linked, additional time and space is needed to maintain indexes

A non clustered index can consist of one or more columns, but the data storage is not dependent on this create index statement as is the case with the clustered index.  For a table without a clustered index, which is called a heap, the non clustered index points the row (data).  In the circumstance where the table has a clustered index, then the non clustered index points to the clustered index for the row (data).

A single table can have up to 249 non clustered indexes. Follow this link to check how clustered index works?
https://www.red-gate.com/simple-talk/sql/learn-sql-server/effective-clustered-indexes/
The users can create an index on each table to retrieve data rows from the table quickly. ... The useof the non-clustered index in SQL server is a better choice than clustered index because clustered index stores and sort data in the physical order and a table can have only one clustered index.
Clustered indexes and nonclustered indexes often perform best under different circumstances. 
http://www.interviewquestionspdf.com/2014/09/ms-sql-server-index-interview-questions.html

37) what is View?
Saved SQL statements or virtual tables.
A view can be useful when there are multiple users with different levels of access, who all need to see portions of the data in the database.
Views can restrict access to specific rows or columns in a table , can join multiple tables , can have aggregate information such as count, sum etc

CREATE VIEW 

AS

SELECT Statement;
38) Can we INSERT, DELETE, UPDATE in VIEW? 
A View can be updated,deleted,inserted if it has only one base table if the view is based on columns from one or more tables then insert, update and delete is not possible.
https://www.codeproject.com/Articles/236425/How-to-Insert-Data-Using-SQL-Views-Created-Using-M
39)difference between the ‘Execute TSQL’ and ‘Execute SQL’ commands?
The TSQL tasks take up much less memory. However, SQL-based tasks support a wider variety of connections.
40) Difference between Views and user defined functions
 User-Defined Function can accept parameters, where as Views cannot. And also the output of the User Defined Function can be directly used in the SELECT clause, whereas you cannot do it with a View.

41)How to improve the performance of a SSIS package?

1- Utilize parallelism: It is easy to utilize parallelism in SSIS. All you need to do is to recognize which Data Flow Tasks (DFTs) could be started at the same time and set the control flow constraints of your package in the way that they all can run simultaneously.

 2- Synchronous vs. Asynchronous components: A synchronous transformation of SSIS takes a buffer, processes the buffer, and passes the result through without waiting for the next buffer to come in. On the other hand, an asynchronous transformation needs to process all its input data to be able to give out any output. This can cause serious performance issues when the size of the input data to the asynchronies transformation is too big to fit into memory and needs to be transferred to HDD at multiple stages.

 3- Execution tree: An execution tree starts where a buffer starts and ends where the same buffer ends. These execution trees specify how buffers and threads are allocated in the package. Each tree creates a new buffer and may execute on a different thread. When a new buffer is created such as when a partially blocking or blocking transformation is added to the pipeline, additional memory is required to handle the data transformation; however, it is important to note that each new tree may also give you an additional worker thread.

 4-OLE DB Command transformation: OLE DB Command is a row-by-row transformation, meaning that it runs the command in it on each one of its input rows. This make sit to be damn too slow when the number of the rows goes up. The solution for boosting performance is to stage data into a temporary table and use Execute SQL Task outside that DFT.

 5-SQL Server Destination vs. OLE DB Destination: There is multiple reason why to use OLE DB Destination and not use SQL Server Destination:

OLE DB Destination is mostly faster,

OLE DB Destination is a lot clearer when it fails (The error message is more helpful),

SQL Server Destination works only when SSIS is installed on the destination server.

6- Change Data Capture (CDC): Try to reduce the amount of data to be transferred to the maximum level you can, and do it as close to the source as you can. A Modified On column on the source table(s) helps a lot in this case.

 7- Slowly Changing Dimension (SCD) transformation: There is only one advice about SSIS’s Slowly Changing Dimension transformation, and that is get rid of it! The reasons are:

It doesn’t use any cached data, and goes to the data source every single time it is called,

It uses many OLE DB Command transformations,

Fast Data Load is off by default on its OLE DB Destination.

 8. Choose the best way in designing Data flow between SQL and SSIS: Remember SSIS is good at Row by Row operations where AS SQL is not. So depends on the situation design data flow using DFT components instead of executing a query using “Execute SQL Task”.

 9. Use queries for selecting data rather than selecting a table and checking off the columns you want. This will reduce the initial record set before SSIS gets it rather than ignoring the fields

 10. Carefully deal with your connections. By default, your connection manager will connect to the database as many times as it wants to. You can set the RetainSameConnection property so it will only connect once. This can allow you to manage transactions using an ExecuteSQL task and BEGIN TRAN / COMMIT TRAN statements avoiding the overhead of DTC.

 11. While running the package with in BIDS ensure you set the package to run in optimized mode.

 12. While loading data into destination tables it’s helpful to use the “Fast Load option”.

13. Wherever possible Consider aggregating and (un)pivoting in SQL Server instead doing it in SSIS package – SQL Server outperforms Integration Services in these tasks;

 14. Avoid manipulating large datasets using T-SQL statements. All T-SQL statements cause changed data to write out to the transaction log even if you use Simple Recovery Model.

 15. For large datasets, do data sorts at the source if possible.

 16. Use the SQL Server Destination if you know your package is going to run on the destination server, since it offers roughly 15% performance increase over OLE DB because it shares memory with SQL Server.

 17. Increase the network packet size to 32767 on your database connection managers. This allows large volumes of data to move faster from the source servers.

 18. If using Lookup transforms, experiment with cache sizes – between using a Cache connection or Full Cache mode for smaller lookup datasets, and Partial / No Cache for larger datasets. This can free up much needed RAM.

 19. Make sure “Lock Options” is using while loading very large datasets as bulk insert happens when it satisfies the below conditions.

a) Destination table is empty

b) Destination database recovery model is either simple or bulk insert

c) When table lock option specified

20. Experiment with the DefaultBufferSize and DefaulBufferMaxRows properties. You’ll need to monitor your package’s “Buffers Spooled” performance counter using Perfmon.exe, and adjust the buffer sizes upwards until you see buffers being spooled (paged to disk), then back off a little.

 21. Do all set based, aggregations and sort operations at source or destination using T-SQL.

 22. If possible always use “NOLOCK” at source and “LOCK” at destination.

 23. While loading to data warehouses try to disable the indexes while loading.

42) How to enable execution plan in sql server?

Before running query, press keystroke CTRL+M.

43) How to show literal strings in sql server

Double byte Characters are Stored/displayed as 

questions Marks if the Prefix N is not used.

44)Transactions in SQL Server

Group of commands that change the data stored in 

a database. A transaction 

is treated as a single unit. If all commands in 

transaction succeed, then only 

the program succeeds. Else, it will roll back.

https://www.youtube.com/watch?v=shkt9Z5Gz-U

SET TRANSACTION ISOLATION LEVEL READ 

UNCOMMITTED 

if you cannot see uncommitted changes.

settings in flat file source to ignore error.

Error
Specify what should happen when an error occurs: ignore the failure, redirect the row, or fail the component.

indexed view
Architecture of sql server
query optimization techniques
ssis window components
how to optimize stored procedure if there are multiple statements like cursors ?
why indexes are slow during heavy insert, update, deletes?
compare between sql server 2012 and sql server 2016
difference between nchar(n) and nvarchar(n)
what is role playing dimension?
difference between surrogate key and primary key?
difference between star schema and snowflake schema?
DT_WSTR-NVARCHAR MEANS UNICODE AND DT_STR-VARCHAR
Do you know DIMENSIONAL MODELLING?
What are KPI's?
Selfjoin question based on employee manager relationship
Explain the architecture of your last project diagram?
Can you write pseudocode to get the result in any of your favorite language
1,2,3,4
1,2,3
1,2
1
?

Data Warehousing and Database

Thursday, June 6, 2019

Python Beginner

Python is a programming language that lets you work more quickly and integrate your systems more effectively. (Source: https://www.python.org/)

Thursday, May 9, 2019

SSIS interview Questions 2019

27)Differentiate between Union All and Merge operations.

GEN AI

Total Pageviews

My Favourites