Friday, July 13, 2018

Query Processing in DBMS

SQL is a high level language created to build a bridge between user and DBMS for their communication. But the underlying systems in the DBMS will not understand SQL. Any query written in SQL is converted into low level language using relational algebra which system can understand. DBMS verifies the code written by the user and then converts them into low level languages. It then selects the best execution path and executes the query and gets the data from internal memory. All these processes are together known as query processing.

When a query is submitted to the database, it is received by the query compiler after translating to its internal form. Parser checks the syntax of the user's query and verifies the relations in the database and so on. Query optimizer then picks them to identify the best query plan to process. It checks in the system catalog for the constraints and indexes and decides the best query plan. It generates different execution plans for the query plan. The query execution plan then decides the best and optimized execution plan for execution. once the query plan is chosen, the query is evaluated with that plan and returns the result. so the four phases in query processing are:
  • Parsing and Translation
  • Query Optimization
  • Evaluation or query code generation
  • Execution in DB’s runtime processor
Search algorithms that locate and retrieve records that fulfill a selection condition.his is the lowest level of query processing method. It has linear search (each record is read from the beginning of the file till search record is reached. It checks each record for filter condition one after the other. Records can be fetched using linear search irrespective of filter condition, sorting and index.and binary search (applicable only when the records are sorted based on the search key value and we have equal condition. i.e.; this method is not suitable for range operation or any other kind. The filter condition should be ‘search key column = value’) methods.
 Conjunction
This is the reference to the conditions in the WHERE clause combined with AND operator. For example, SELECT * FROM STUDENT WHERE empID = 101 AND AGE>=21;
  • If we can access the record by using any one of the condition which provides better access path i.e.; if age>=21 alone can give all the records that matches both age>=21 and empID = 101, and if it has shortest cost time or vice versa.
  • If any one of the index and selection methods described above provides the better cost.
Disjunction
This is the reference to the conditions in the WHERE clause combined with OR operator. 
SELECT * FROM STUDENT WHERE empID = 101 OR AGE>=21;

SELECT * FROM STUDENT WHERE empID = 101 UNION SELECT * FROM STUDENT WHERE AGE>=21;

This is even considered as queries with UNION operator. In this case, indexes will be used when all the search key columns have indexes. It will convert the queries with OR operators to queries with UNION and will use the indexes based on the search key column. Otherwise, it uses linear search.
Negation
It has not equal condition used in the WHERE clause. In most of the cases it uses the linear search method to fetch the records. If the index is present on the search key column then index is used search the records.  
SELECT * FROM STUDENT WHERE empID <>101
Relational operations such as joins , can be implemented efficiently if the input relations are first sorted.
Sort the relation by building an index on sort key and use that index to read the relation in sorted order logically. It may lead to disk access for each record (expensive). so better to order records physically.
Sorting
Eg:
Select salary from instructor where salary < 75000;
it can be written as
it is the responsibility of the system to construct a query evaluation plan that minimizes the cost of query evaluation (query optimization).

Query cost measures

* disk accesses, CPU time to execute query in a distributed or parallel database system and cost of communication.
While calculating the disk I/O time, usually only two factors are considered – seek time and transfer time. The seek time is the time taken the processor to find a single record in the disk memory and is represented by tS. The time taken by the disk to return fetched result back to the processor / user is called transfer time and is represented by tT. 
Suppose a query need to seek S times to fetch a record and there is B blocks needs to be returned to the user. Then the disk I/O cost is calculated as below
(S* tS)+ (B* tT)
Different types of selection methods of a query which affects the cost of the query
File Scan


 Index Scan

 Uses indexes to traverse the record. The search key should be indexed and should be used in the filter condition for an Index scan to work.



Conditions in the WHERE clause combined with AND operator.
Eg: select * from employee where empID=101 AND age>=21;
Check,
conjuctive selection using one index, conjuctive selection using composite index(index on multiple attributes)- selection specifies an equality condition on two or more attributes, conjuctive selection by intersection of identifiers (use of record pointers or record identifiers)- scans each index for pointers to tuples that satisfy an individual condition and algorithm then uses pointers to retrieve the actual records.
Conditions in the WHERE clause combined with OR operator/queries with UNION operator.

Quick sortIf a table size is small and can be accommodated into current memory, then quick sort can be used. As the name suggests it is simple and easy method of sorting.
Merge Sort: For the larger tables which cannot be accommodated in the current memory, this type of sorting is used. It has better performance compared to quick sort.
Memory cannot hold all the records of large tables; In the example above, it can hold up to 4 records only. Suppose Initial table has 12 records .  When merge sort is applied, the records are grouped into 3 with two blocks each. Each block is merged at pass 1 and sorted for the pass 2, where it again merged and sorted. At the last stage blocks at pass 2 are merged and sorted to give the final result. 
 Joins: When we use the joins in the queries, the query optimizer decides which joins to be used. Based on the better cost of the query, any one of the following join method will be used by the query optimizer.
Nested Loop Join:
r theta join s
r is outer relation and s the inner relation of  join.
Result: tr*ts
*no indices
*expensive as it examines every pair of tuples in the two relations.
Algorithm

cost: nr*bs +br block transfers. it is always better to put smaller tables in the inner loop, if it accommodated into its memory. Then the cost will reduce to 2 seeks leading to br+bs block transfers. Cost of nested loop join depends on number of records, blocks and positioning of the tables. However, the cost of nested loop join is very expensive compared to other joins.

Block Nested Loop Join (process relations on a per block basis, rather than on a per-tuple basis):
Every block of the inner relation is paired with every block of outer relation.
Here, in addition to looping each record in outer and inner tables, their blocks are also added for looping. Hence, this method selects block of records from both the tables and compares the records in them. Hence the processing is done block wise.
Here, it traverses each block of records in a loop. Each of outer block reads inner table records only once.  So cost of this method is better compared to NLJ.The number of seeks required is 2 BR and the number of block transfer is given as (Br * Bs) + Br. In the best case where records of both table fit into memory block, the cost is given as 2 seeks and Br+ Bs block transfers. If  join condition is on candidate key of the tables, then it need not traverse through all the records or blocks of inner table. At one point seek itself, it will get the record.
If an index is available on the inner loop's join attribute, replace file scans with efficient index lookups.
Indexed Nested Loop Join:
If an index is available on the inner loop's join attribute in a NLJ, index lookups can replace file scans. for each tuple tr in the outer relation r, the index is used to look up tuples in s that will satisfy the join condition with tuple tr.
* Can be used with existing indices as well as with temporary indices .
*When indexes are used, there is no sequential scan of records. For the indexes are used on the columns  in join condition, it will not scan each records of inner table. It will directly fetch matching record. * Indexes are useful when natural joins or equijoins are used. 
In the worst case, the cost of INLJ is calculated as br*(tT+tS) +nr*C, where nr is the number of records in relation r, tT and tS are each records of table T and S respectively, and C is the cost of fetching single record in the index table.
INLJ is the most efficient one among all.
Merge Join:

*Tables may be sorted or not sorted

*Uses merge-sort technique to sort the records in two tables as it is cost-effective. After sorting,  join condition is applied to get the result
*sort-merge join - used to compute natural joins and equi-joins
Eg: r(R) natural join s(S) have R intersection S common attributes. if these attributes are sorted, join can be computed like merge-sort algorithm.
Assuming all records fit in memory block, Cost of Seeks: [br/bb] + [bs/bb] , where bb buffer blocks are associated to each relation.
Cost of Block Transfer: BR + BS 


Hybrid Merge Join:



Uses If one relation is sorted, and the other has a secondary B+-tree index on the join attribute.
*Merge the sorted relation with the leaf entries of the B+-tree .
* Sort the result on the addresses of the unsorted relation’s tuples
*Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples Sequential scan is more efficient than random lookup.

Hash Join

*used to implement natural joins and equi joins
*hash function h is used to partition tuples of both relation.
*Partition tuples into sets that have same hash value on the join attribute.


tr natural join ts - concatenation of tuples tr and ts.
performs a seperate INLJ on each of partition pairs for i=0..nh. It first builds a hash index on each si and then probes(looks up si) with tuples from ri. The relation s is build input and r is probe input.
Cost:
“Bucketize:” Read R + write 
                  Read S + write
Join: Read R, S Total cost = 3 x [ bR+bS] page3image2017102720 page3image2017102944 page3image2017103168 page3image2017103456
Note: this is an approximation since buckets will vary in size and have to round up to blocks 
Issues with Hash Join

*some bad hash function or bad join condition will produce multiple same records. Then it is difficult to accommodate all the records into one partition block. This is called hash table overflow

*Can be avoided by carefully partitioning the tables. This can be resolved while creating the partition itself by creating sub partitions using another hash function. But both the tables should be partitioned in same way. 
* if tables have lot many duplicate values , use BNLJ method.

skewed partition : some partition will have relatively large number of records compared to other partitions. This is also not good for joining the tables.
Hybrid Hash Join
*useful when memory sizes are relatively larger; but not all of the build relation fits in the memory.
For complex joins (query with more than two tables) ,use conjunctions and disjunctions with effective joining techniques. Create index on joining conditions. it will fetch the matching record in one shot, rather than traversing each record of tables.  This method performs pair join rather than joining with one table at a time which means all matching records are fetched in one step.
Eliminating Duplicates and Aggregation
If the user wants to eliminate duplicate records, use DISTINCT keyword in the query.
SELECT DISTINCT * FROM student; -- retrieves unique records from the student table and shows only one record for multiple occurrence of the records.
Cost for deleting the duplicate will depend on the technique used to group the duplicate values and the cost of identifying and deleting those records. This is expensive.
Aggregation operation uses sorting or hashing techniques to get similar groups of records together and group using SUM, AVG etc. Cost estimate for implementing aggregation operation is same as the cost of duplicate elimination, for aggregate functions such as min, max, sum, count and avg. Implement aggregation operations sum, min, max, count and avg on the fly as the groups are being constructed. 
In hashing method, sorting can be optimized during run generation or merge passes. During run or merge passes, it can calculate the partial aggregate values in that block and can pass it to next merge pass.


For on the fly aggregation techniques, only one tuple needs to be stored for each of the groups. So sorted tree structure or hash index fits in memory and the aggregation can be processed with b block transfers ( 1 seek) instead of 3br transfers.
Set Operations
Set operations like UNION, INTERSECTION, MINUS etc are implemented using hashing or merge join methods. But it needs tables to be sorted before. 
*Partition the two relations by same hash function
*create partitions r0, r1, r2....rnh and s0, s1, s2...snh.
*system takes these steps on each partition i-0,1,2....nh.
r union s
*Build in memory hash index on ri, add tuples in si to the hash index if not present  and add tuples in the hash index to result.
r intersection s
*Build in memory hash index on ri, for each tuple in si, probe the hash index and output the tuple to the result only if it is already present in the hash index.
r minus s
*Build in memory hash index on ri, probe the hash index and delete if tuple is already present in the hash index. Add remaining tuples in the hash index to result.
Join
The FULL OUTER JOIN keyword return all records when there is a match in either left (table1) or right (table2) table records.

SELECT column-names
FROM table-name1 FULL JOIN table-name2
ON column-name1 = column-name2
WHERE condition
Note: FULL OUTER JOIN can potentially return very large result-sets!
The LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match.

SELECT column-names
FROM table-name1 LEFT JOIN table-name2
ON column-name1 = column-name2
WHERE condition
The RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match.
A self JOIN is a regular join, but the table is joined with itself.
The INNER JOIN keyword selects records that have matching values in both tables.



SELECT column-names

FROM table-name1 INNER JOIN table-name2

ON column-name1 = column-name2

WHERE condition

Eg:

Table 1


ID | NAME     | AGE | ADDRESS   | SALARY   |
+----+----------+-----+-----------+----------+
|  1 | Jis      |  32 | Bhandup   |  2000.00 |
|  2 | Jomini   |  25 | Delhi     |  1500.00 |
|  3 | komalam  |  23 | Kota      |  2000.00 |
|  4 | Chaithan |  25 | Mumbai    |  6500.00 |
|  5 | Haven    |  27 | Bhopal    |  8500.00 |
|  6 | Kimma    |  22 | MP        |  4500.00 |

Table 2:
+-----+---------------------+-------------+
|OID  | DATE        CUSTOMER_ID | AMOUNT |
+-----+-------------------------+--------+
| 102 | 2009-10-08            3 |   3000 |
| 100 | 2009-10-08            3 |   1500 |
| 101 | 2009-11-20            2 |   1560 |
| 103 | 2008-05-20            4 |   2060 | 
SELECT ID, NAME, AGE, AMOUNT
 FROM Table 1, Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;


+----+----------+-----+--------+
| ID | NAME     | AGE | AMOUNT |
+----+----------+-----+--------+
|  3 | komalam  |  23 |   3000 |
|  3 | komalam  |  23 |   1500 |
|  2 | Jomini   |  25 |   1560 |
|  4 | Chaithan |  25 |   2060 |
+----+----------+-----+--------+
SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 LEFT JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;


+----+----------+--------+---------------------+
| ID | NAME     | AMOUNT | DATE                |
+----+----------+--------+---------------------+
|  1 | Jis      |   NULL | NULL                |
|  2 | Jomini   |   1560 | 2009-11-20 
|  3 | komalam  |   3000 | 2009-10-08 
|  3 | komalam  |   1500 | 2009-10-08 
|  4 | Chaithan |   2060 | 2008-05-20 
|  5 | Haven    |   NULL | NULL                |
|  6 | Kimma    |   NULL | NULL  
SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 RIGHT JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;


+----+----------+--------+---------------------+
| ID | NAME     | AMOUNT | DATE                |
+----+----------+--------+---------------------+
|  3 | Komalam  |   3000 | 2009-10-08 
|  3 | Komalam  |   1500 | 2009-10-08 
|  2 | Jomini   |   1560 | 2009-11-20 
|  4 | Chaithan |  2060 | 2009-10-08 
SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 FULL JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;


+------+----------+--------+---------------------+
| ID   | NAME     | AMOUNT | DATE                |
+------+----------+--------+---------------------+
|    1 | Jis      |   NULL | NULL                
|    2 | Jomini   |   1560 | 2009-11-20 
|    3 | Komalam  |   3000 | 2009-10-08  
|    3 | Komalam  |   1500 | 2009-10-08 
|    4 | Chaithan |   2060 | 2008-05-20 
|    5 | Haven    |   NULL | NULL                
|    6 | Kimma    |   NULL | NULL                
|    3 | komalam  |   3000 | 2009-10-08 
|    3 | komalam  |   1500 | 2009-10-08 
|    2 | Jomini   |   1560 | 2009-11-20 
|    4 | Chaithan |   2060 | 2008-05-20 
+------+----------+--------+---------------------+


mySQL does not support full join. Instead we can use UNION ALL. it will produce same result shown above.

SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 LEFT JOIN Table 2

WHERE  Table 1.ID = Table 2.CUSTOMER_ID
UNION ALL
SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 RIGHT JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;

SELECT ID, NAME,  AMOUNT, DATE

FROM Table 1 LEFT JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID
UNION 
SELECT ID, NAME,  AMOUNT, DATE
FROM Table 1 RIGHT JOIN Table 2
WHERE  Table 1.ID = Table 2.CUSTOMER_ID;
+------+----------+--------+---------------------+
| ID   | NAME     | AMOUNT | DATE                |
+------+----------+--------+---------------------+
|    1 | Jis      |   NULL | NULL                
|    2 | Jomini   |   1560 | 2009-11-20 
|    3 | Komalam  |   3000 | 2009-10-08  
|    3 | Komalam  |   1500 | 2009-10-08 
|    4 | Chaithan |   2060 | 2008-05-20 
|    5 | Haven    |   NULL | NULL                
|    6 | Kimma    |   NULL | NULL   
Use hash join or merge join to perform equijoin in first step. In other words, natural outer joins and outer joins with an equi-join condition can be computed by extensions of the merge-join and hash-join algorithms.
Let us have query like below:
SELECT * FROM takes r LEFT OUTER JOIN student s ON r.ID = s.ID;
The processor picks all matching records of takes and student by using the best join algorithm , then it picks the records of takes which are not selected by hash join or merge join and adds them to the result set by appending NULLs to the columns of takes.

Read this for understanding more about joins
http://www.sql-join.com/sql-join-types/
Methods of evaluating a query written in SQL
1) Materialization -generate results of an expression whose inputs are relations or are already computed, materialize(store) it on disk. Repeat.
To be specific, queries are broken into individual queries and then the results of which are used to get the final result.
Eg: SELECT c.name FROM department s, instructor c 
where s.building='Watson';
DBMS breaks the query into two and evaluates the first query and stores it in the temporary table in the memory. This temporary table data will be then used to evaluate the second query. This is the example of two level queries in materialization method. We can have any number of levels and so many numbers of temporary tables. This is called materialized evaluation as the results of each intermediate operation are created (materialized) and are used for evaluation of next-level operations.
Cost of materialized evaluation is always more. It takes the time to evaluate and write into temporary table, then retrieve from this temporary table and query to get the next level of result and so on. Hence cost of evaluation in this method is: 
    Cost = cost of individual SELECT + cost of write into temporary table

2) Pipelining
In this method, DBMS do not store the records into temporary tables. Instead, it queries each query and result of which will be passed to next query to process and so on. It will process the query one after the other and each will use the result of previous query for its processing.
*No extra cost of writing into temporary tables. It has only cost of evaluation of individual queries; hence it has better performance than materialization.
*Quick query results
*Low memory requirements, as the results of an operation are not stored for long.

Pipeline Implementation
*construct a single complex operation that combines the operations that constitute the pipeline.

1) Demand -driven pipeline (Lazy evaluation)
In this method, the result of lower level queries are not passed to the higher level automatically. It will be passed to higher level only when it is requested by the higher level. In this method, it retains the result value and state with it and it will be transferred to the next level only when it is requested.
In our example above, building 'Watson' will be retrieved first from the department table, but it will not be passed to select query immediately. It will pass only when it is requested. Once it gets the request, it is passed to instructor query to retrieve name of instructor and that query will be processed.
System makes repeated requests for tuples from the operation at the top of the pipeline. Each time an operation receives a request for tuples, it computes the next tuples to be returned and then returns the tuple.
functions in iterator
open()
next()- return the next output tuple of the operation
close()-no more tuples are required
iterator maintains the state of its execution in between calls, so that successive next() requests receive successive result tuples
2) Producer-driven pipeline (Eager Pipelining/Push and Pull pipelining)
* Lower level queries eagerly pass the results to higher level queries. It does not wait for the higher level queries to request for the results. 
*Lower level query creates a buffer to store the results and the higher level queries pulls the results for its use. if the buffer is full, then the lower level query waits for the higher level query to empty it.
Operations do not wait for requests to produce tuples, but instead generate the tuples eagerly. Each operation in a producer-driven pipeline is modelled as a separate thread within the system that input tuples from its pipelined inputs and generates tuples stream for its output.
Other pipelining methods: Linear and non-linear methods of pipelining, left deep tree, right deep tree

*Producer driven pipeline- pushing data up an operation tree from below, tuples are generated eagerly
Demand driven pipeline- pulling data up an operation tree from the top, tuples are generated lazily on demand. This is used more commonly than producer driven pipelining.
Double pipelined hash join technique- If two inputs to pipeline into the join are not already sorted, use this technique. Input tuples for both input relations r and s are already pipelined and are queued for processing in a single queue.

1 comment:

  1. Excellent post. You have shared some wonderful tips. I completely agree with you that it is important for any blogger to help their visitors. Once your visitors find value in your content, they will come back for more What is the DBMS and it's uses



    ReplyDelete

GEN AI

  Stay Tuned....