Optimization

Example:

Compare two strategies:

  1. evaluate query 'directly':
  2. different procedure:

Differences:

  1. constructs large intermediate table which is then restricted i.e. most rows are thrown away
  2. constructs much smaller intermediate table and accesses only suppliers needed for final result

Stages:

  1. cast query to internal form
  2. convert to canonical form
  3. choose candidate low-level procedures
  4. generate query plan

Query Transformation

Transformation rules: convert query into equivalent but more efficient form

The following sections show some examples for each type.

Restrictions and Projections

Distributivity

E.g. sqrt(x * y) = sqrt(x) * sqrt(y)

Semantic Transformations

Integrity constraints can be used in semantic optimization, e.g. referential constraint:

( SP JOIN S ) { PID } ⇒ SP { PID }

DB Statistics

used in stage 3 and 4, some examples:

Updated according to some scheme, not with each SQL operation

Query decomposition

Allow for execution in parallel or distributed environment

Implementation of Relational Operators

E.g. R JOIN S with common attribute(s) C, m = cardinality(R), n = cardinality(S):

Brute Force ('nested loops'):

for i = 1 to m:
  for j = 1 to n:
    if R[i].C == S[j].C: output R[i], S[j]
Assume: Cost: high

Index Lookup: assume index X on S.C

for i = 1 to m:
  js = lookup R[i].C in X
  for j in js:
    output R[i], S[j]
Assume: Cost: low

Hash: Similar to Index Lookup, but build hash table on demand instead

for j = 1 to n:
  k = hash S[j].C
  add j to H[k] 
for i = 1 to m:
  js = lookup R[i].C in H
  for j in js:
    if S[j].C == R[i].C: output R[i], S[j]
Cost: low

Here is an illustration for a very simple indexing scheme implemented in a C program:

% time db0 create 30000000 200

First the data file is scanned sequentially for lines starting with ABCD:

% time db0 df
ABCDWLMHIBYBTSQCQNJWUB...GDD  422524BFI
ABCDBAQJGUAQXCQAAVLPHD...WPI  589062MDR
ABCDNQMKOJVWKFSAXYICVZ...RAT  631901GGI
...

real	1m28.125s

Next the index is used to find matching rows which are then retrieved from the data file by seeking directly to the desired position:

% time db0 idx seek 
ABCD 422524
ABCDWLMHIBYBTSQCQNJWUB...GDD  422524BFI
ABCD 589062
ABCDBAQJGUAQXCQAAVLPHD...WPI  589062MDR
...

real	0m5.987s

A second run finds the index file still in buffer memory, so the performance is improved:

% time db0 idx seek 
ABCD 422524
ABCDWLMHIBYBTSQCQNJWUB...GDD  422524BFI
ABCD 589062
ABCDBAQJGUAQXCQAAVLPHD...WPI  589062MDR
...

real	0m1.072s

For 50 million records the data file is 9.9 GB. The index is 380 MB which still fits into memory. The timings are:

scan data file: 2m27.176s
use index: 14.504s
2nd run: 2.528s