You can nest queries within the predicates of other queries. This makes it easier to express complex queries, and it makes it possible to
formulate queries in several different ways.
Thus you can create different queries
which return the same rows but which do so with vastly different
performance. In the following discussion, queries 1, 2, and 3 are all
equivalent.
Most subqueries can
also be expressed in an equivalent join form. For example, assuming
that the PartNumber column in PurchDB.Parts is unique, the query:
(1) SELECT * FROM PurchDB.SupplyPrice
WHERE PartNumber IN (SELECT PartNumber from PurchDB.Parts
WHERE PartName = 'Cache Memory Unit')
|
can also be written:
(2) SELECT PurchDB.SupplyPrice.*
FROM PurchDB.SupplyPrice, PurchDB.Parts
WHERE PurchDB.SupplyPrice.PartNumber = PurchDB.Parts.PartNumber
AND PurchDB.Parts.PartName = 'Cache Memory Unit'
|
Both queries return the same information. In general, while the
subquery is more easily understood and easier to formulate, the join actually
improves performance because it gives the optimizer more efficient choices
of how to execute the query.
There are two types of subqueries: correlated and
non-correlated.
A correlated subquery is one in which the subquery makes reference to
one or more columns of an outer query. For example,
(3) SELECT * FROM PurchDB.SupplyPrice
WHERE EXISTS (SELECT PartNumber
FROM PurchDB.Parts
WHERE PartName = 'Cache Memory Unit'
AND PartNumber = PurchDB.SupplyPrice.PartNumber)
|
This is a correlated subquery, since the subquery makes reference to
PartNumber from the outer query. In this case, the subquery must be executed
for each row returned from the outer query. For the above query, since the
SupplyPrice table has 69 rows in it, there must be one
scan on the outer query and 69 scans on the inner query
for a total of 70 scans. The equivalent join shown in query (2)
can be executed using a sort/merge
technique, which takes two sorts and two scans.
It should be noted that the predicate WHERE PartNumber IN of query (1) is
internally the same as the predicate WHERE EXISTS in query (3).
All
quantified predicates involving subqueries get transformed into
EXISTS predicates internally.
Therefore the nested query using the IN
predicate also takes 70 scans.
In general, non-correlated subqueries are faster
than joins, and correlated subqueries are slower than joins.
You should know whether the subquery will return a single row or not
so as to take advantage of the speed of non-correlated subqueries.
Since we know that each part in the Parts table is unique, there is no reason
why we could not express query (1) as a non-correlated subquery by simply
replacing the IN with an equal sign (=).
Then the subquery does not depend on the
outer query and the subquery can be executed only once. In this case,
there is one scan for the outer query and one scan for the
subquery for a total of two scans. This is even faster than the
sort/merge join since there is no sorting involved.
One exception to the rule that correlated subqueries are slower
than an equivalent join is when the equivalent join involves an
aggregate. For example, consider the following query, which
is transformed into a correlated subquery internally by the query
processor:
SELECT VendorNumber, PartNumber, DiscountQty
FROM PurchDB.SupplyPrice
WHERE DiscountQty < ALL (SELECT DiscountQty
FROM PurchDB.SupplyPrice
WHERE VendorNumber = 9010)
|
becomes
SELECT VendorNumber, PartNumber, DiscountQty
FROM PurchDB.SupplyPrice sp1
WHERE NOT EXISTS (SELECT DiscountQty
FROM PurchDB.SupplyPrice sp2
WHERE VendorNumber =9010
AND sp2.DiscountQty <= sp1.DiscountQty)
|
The equivalent join would require the use of an aggregate, and
would therefore be slower:
SELECT sp1.VendorNumber, sp1.PartNumber, sp1.DiscountQty
FROM PurchDB.SupplyPrice sp1, PurchDB.SupplyPrice sp2
WHERE sp2.VendorNumber = 9010
GROUP BY sp1.VendorNumber, sp1.PartNumber, sp1.DiscountQty
HAVING sp1.DiscountQty < MIN(sp2.DiscountQty)
|
In this case, because of the complexity of the query, the subquery is
a better choice than the join. The best solution, however, is an equivalent
non-correlated subquery:
SELECT VendorNumber, PartNumber,
DiscountQty from PurchDB.SupplyPrice
WHERE DiscountQty < (SELECT MIN(DiscountQty)
FROM PurchDB.SupplyPrice
WHERE VendorNumber = 9010)
|
When Not to Use DISTINCT in Subqueries |
 |
In general, you should avoid using the DISTINCT keyword in subqueries. DISTINCT does not change the query result and,
in fact, hinders performance.
The following two queries return the same result:
(4) SELECT * FROM T1 WHERE C1 IN (SELECT DISTINCT C2 FROM T2);
(5) SELECT * FROM T1 WHERE C1 IN (SELECT C2 FROM T2);
|
ALLBASE/SQL transforms IN predicates and also other quantified
predicates where the subquery can return multiple values
to make them similar to an EXISTS predicate.
Therefore, queries (4) and (5) would be transformed to
the two following queries:
(6) SELECT * FROM T1 WHERE EXISTS (SELECT DISTINCT C2 FROM T2
WHERE T1.C1 = T2.C2);
(7) SELECT * FROM T1 WHERE EXISTS (SELECT C2 FROM T2
WHERE T1.C1 = T2.C2);
|
These transformed queries are correlated in nature, meaning that
the subquery result depends on values of the outer query.
This means that the subquery needs to be reevaluated for each row
of the outer query.
Therefore, the performance of the subquery is critical.
When you use the DISTINCT keyword, the subquery must do a complete
scan of the table, sort the values, eliminate duplicates,
and then return TRUE if any rows are returned.
Without the DISTINCT keyword, the subquery can
scan until it finds the first qualifying row, return the row,
return TRUE, and then terminate.
As you can see, subqueries without the DISTINCT keyword are faster.