One method of a relational database design

There no need in special homepage - all that i had to say on the subject i have already said in the article - just read this article you have see below.

and if you have anything to say on this article, say it to me :-) sad@pisem.net, sad@bankir.ru

DAMN ! don't know how to force browsers to prints MONOSPACE FONT !!!


comment:
Using uppercased words "ANY" "EXISTS" "NOT EXISTS" 
instead of quantors in predicates.
Using lowercased word "from" instead of element to set relation "belongs".
Using lowercased words as table names and relation names 
in prefix and infix notation as well.




                   
  Suppose, the problem is to choose a tree representation. Either nested-tree 
(fig.1) or by reference to parent (fig.2) or may be another way.
(every record represents a single tree node, here)

fig.1                                                                 (1)
                  tree
                .---------.
                | #id     |        
                | l    INT|        
                | r    INT|        
                | info    |
                ._________.
                
  where:
   ANY t from tree (t.l<t.r)
   ANY t1,t2 from tree ((t1.l<t2.l & t1.r>t2.r) <=> t1 is_forfather_of t2)
   NOT EXISTS t1,t2 from tree (t1.l<t2.l<t1.r<t2.r)
    

fig.2                                                                 (2)
                  tree
                .---------.
                | #id     |<---\                        N->1
                | parent  |----/  reference is:  parent ----> id
                | info    |
                ._________.


(2) is very simple, but most problems are recursive, such a subtree selection,
or an all forfathers branch selection.
vise-versa with (1) most bproblems have simple one-select solutions, but the
simpliest problems (such as a brother selection) are more difficult, and yet 
another serious problem - a nest maintaimance - is specific.
 
  Looking for compromise (composite?) solution use the following method.

For the sake of effective design. Split _all_ data attributes (what we intended
to represent in a DB) onto exactly two non-intersect classes. Call them
"main" and "additional" (bad names) (initially "additional" may be empty, 
generally doesn't matter, but in practice initial state of an "additional" 
class is obvious enough to be non-empty). Then we hypotetically remove the 
"additional" attributes, and then _normalize_ data structure. If redundancy 
occured then move a redundant attribute to the "additional" class. When the 
data normilized we return the removed attributes, and design triggers to 
control them totally.
  Advantages are:
the data _substructure_ exists wich is in a _normalized_ state and its 
attributes class maximized;
and a question "when to stop normalization?" is obsolete.
The whole data is not in normilized state, but we exactly determine what to do
to avoid anomalies. In other words we bounded a normalization procedure in
rather different way, clearer to further use.
  As a result of this procedure the "main" class should consists a sufficient
minimum of attributes causes the data integral - that's for check. And we 
reserve (rather to say achieve) a right to add fields to the "additional" class 
further.

  Obviously this method attended to raise perfomance of SELECTs and may cause 
serious calculations in INSERTs or UPDATEs. So you must realize how many 
"additional" fields to have, and where to use this method. It is good for 
public access DBs - reference-books, e-shops, transport schedules - everywhere 
number of SELECTs much higher than number of UPDATEs. But it does not mean that
there is no way to use the method  elsewhere, - DBMS perfomance generally 
allows much more load, nowadays, and moreover you may separate a whole data 
structure onto parts of different UPDATEs' rate.

  As application of the method, described above, the heading problem of tree
representation has the following solution. (Note: the method much more general)
    
fig.3                                                                    (3) 
                  tree
                .-----------.
                | #id       |                          N->1
                | parent    | reference is:    parent -----> id
                | l      INT|
                | r      INT|
                | info      |
                .___________.

where:                                                                   (*)
   ANY t from tree (t.l<t.r)
   ANY t1,t2 from tree ((t1.l<t2.l & t1.r>t2.r) <=> t1 is_forfather_of t2)
   NOT EXISTS t1,t2 from tree (t1.l<t2.l<t1.r<t2.r)  
   
It is clear to see the pairs (id,parent) and (l,r) represent exactly the same
information (tree layout), let the (l,r) to be "additional" (that because
(id,parent) is clearer to programmer). The tree(id,parent,info) is normilized.
Then the function to calculate (l,r) from (id,parent) is:

TRIGGER BEFORE INSERT INTO TABLE tree;
-- new record is represented by the variable "new"
BEGIN 
 -- input parameters here are new.id, new.parent as declared
 -- _and_ the field "r" of the parent record.
 -- the lawfulness of that trick is the result of the condition of the table:
 -- all the records already stored in the table are in the correct state.
 -- to achieve this condition it is sufficient to insert all records
 -- with this function and _initially_ have one correct record (e.g. 
 -- just a record satisfied (l<r)) 
 
 SELECT tree.r INTO new.l FROM tree WHERE tree.id = new.parent;
 new.r := new.l+1;
 
 -- recalculate existing l,r to satisfy (*)   

 -- shift right boundaries of all right relatives and all forfathers
 -- to right by 2

 UPDATE tree SET tree.r=tree.r+2 WHERE tree.r>=new.l;
 
 -- shift to right by 2 all left boundaries of right relatives (not forfathers)
 
 UPDATE tree SET tree.l=tree.l+2 WHERE dept.l>new.l;

 RETURN new;
END;  

The function redefines new.l, new.r ignoring a user input of them. 
So, to keep the tree integrity, just forbid users to update (id,parent,l,r).

If you have no need in leaf property l+1==r, then nothing to be done
(except foreign key check) when deleting rows.

  The function do only guarantee enough nest size to insert a new nest,
so nests are growing wider. For the integer type is bounded in a few bytes
the following service procedure may be useful.


CREATE FUNCTION tree_renum() RETURNS INT 
DECLARE c INT;
BEGIN
 c=subtree_renum(0,0);
 RETURN c/2;
END;

-- parameters
-- $1 - current node id 
-- $2 - next l value
-- return - next r value

FUNCTION subtree_renum(INT,INT) RETURNS INT
DECLARE k record; c int4;
BEGIN
  c=$2;
  UPDATE dept SET l=c WHERE id=$1;
  c=c+1;
  FOR k IN SELECT id FROM dept WHERE parent=$1
  LOOP
   c=subtree_renum(k.id,c);  -- really don't know how to comment this!
  END LOOP;
  UPDATE dept SET r=c WHERE id=$1;
  RETURN c+1;
END;


For the sake of functionality here is the procedure to move a subtree
within the tree.

-- parameters
-- $1 - subtree root-node id
-- $2 - desired father node id (to be parent of $1) (e.g. "move to")
FUNCTION move_subtree(INT,INT) RETURNS INT AS '
DECLARE 
 sh INT; nil INT;
 tmp record;
 old record;
BEGIN
 SELECT * INTO old FROM tree WHERE id=$1; 
 SELECT * INTO tmp FROM tree WHERE id=$2; 
 
 -- if ($2 is already father of $1)
 IF tmp.id=old.parent THEN
  RETURN -old.id;
 END IF;
 
 -- forbid cycles! (if the desired father belongs to the subtree)
 IF tmp.id IN (SELECT id FROM tree WHERE l>=old.l AND r<=old.r) THEN
  RETURN -1; 
 END IF;  
 
 -- else to work

 UPDATE tree SET parent=tmp.id WHERE id=old.id;

 -- shift all subtree nests to right to lay whole subtree to right from
 -- the left boundary of its new parent
 -- and simultaneously mark whole subtree (by negation of l) (!!!)
 UPDATE tree SET l= -(l-old.l+tmp.r), r=r-old.l+tmp.r 
  WHERE l>=old.l AND r<=old.r;

 -- compute enough new parent nest width to include subtree 
 sh=old.r-old.l+1;
 
 -- recalculate l and r,
 -- but do not touch the subtree (marked with l<0)
 UPDATE tree SET r=r+sh WHERE l>=0 AND r>=tmp.r;
 UPDATE tree SET l=l+sh WHERE l>=0 AND l>tmp.r;
 
 -- unmark the subtree
 UPDATE tree SET l= -l WHERE l<0;

 RETURN old.id;
END;

This procedure may also be defined as a trigger on update 
when new.parent<>old.parent (no serious upgrade of the procedure body is 
needed).

  Finally we have a good way to avoid mass UPDATEs of nests - use REAL numbers
as a nest boundaries. (Note: we loose a leaf property _only_). The advantage
is based on property:
ANY a,b from D, a<b (EXISTS c from D (a<c<b)).

  The physical boundary is the "machine zero". The tree_renum procedure still
remains, and even without upgrade at all. And a trigger on INSERT must be
similar to the following.

TRIGGER BEFORE INSERT INTO TABLE tree;
DECLARE 
 tmp REAL; -- or FLOAT as you wish
 tmp2 REAL;
 E REAL; -- a small number > 0
 D REAL; -- magic divider
BEGIN
 E:=10E-10; -- the smaller - the better
 D:=5;      -- we discuss later
 
 -- look for segment from rightmost nest on the desired level to right boundary
 -- of parent one
 SELECT max(r) INTO tmp FROM tree WHERE parent=new.parent;
 IF NOT FOUND THEN -- if parent nest is empty
  SELECT l INTO tmp FROM tree WHERE id=new.parent;
 END IF;
 SELECT r INTO tmp2 FROM tree WHERE id=new.parent;

 new.l:=tmp+E;
 new.r:=(tmp2-new.l)/D+new.l;

 RETURN new;
END;

Note: In PostgreSQL you may use type NUMERIC instead of REAL 
The E used to generate a correct ordering (remember, we declared exact 
unequations in (*)). And the D demands a discussion.

  It seems the grater D causes nests to growing small faster. Nest width
is (available_width)/D, so when we build a tree similar to fig.4, nests width
is estimated with o(D^(-n)), where n is level.

fig.4
             o root
             |
        |    o
       n|    |       
        v    .
             .
             |
             o leaf

  But every nest truncates available_width of its parent nest. And because
we not recalculate previously allocated nests, the brother nests are growing
narrower from left to right, estimation is o(((D-1)/D)^n) where n is number
of brothers in the brotherhood. It is obviously, the grater D causes nests to 
growing small slowly.

fig.5
             o root
            /|\
           / | \
          /  |  \
         /   |   \
        o    o... o leaves

            --->
             n

  So I propose to discuss a worst case - full k-tree of n-depth.
To simulate the case use a simple perl-script (or similar):

 $n=10; # tree depth set what you want
 $k=10; # tree width set what you want
 
 $l=1; # initial nest width 
       # (no need to set larger bacause of real-types implementation)

 $d=5; # Divider. VARY HIM ! and see results

 $ll=0; $last=0; # temp.

 for ($j=1;$j<=n;$j++)
  {
   for ($i=1;$i<=k;$i++)
    {
     $ll=$l/$d;
     print $j.' '.$i.' '.$ll.' '.(1/($l-$ll))."\n";
     $l=$l-$ll; # the current child takes place within the parent nest
    };
   $last=$l; # this for final output only
   $l=$ll;   # fall down to the last nest to allocate next level nests within
  };
 print $d.' '.$l.' '.(1/($l*$d)).' '.(1/$last)."\n";
 print 'D, last nest, first/last ratio, initial to final available_width ratio';

This script displays the weaker nests only. As described in fig.6

fig.6
                 o   root 
                /|\
               o o o        |
                  /|.       |j
                 o o .      |
                      o     v
                     /|\
                    o o o
                       /|\
                      o o o <-- last (narrowest nest all over the full tree)
                       
                      --->
                       i

  Try the script for desired tree parameters and different D. You'll see that 
lager D causes lower ratio first_nest_width/last_nest_width
_and_ it is as brighter effect as tree is wider.

  Unfortunately no extremum found for 
f(d)=first_nest_width(d)/last_nest_width(d) 
anywhere in the plane n*k.
So look at the script output and decide intuitive what D you need.

  As soon as E defined to be the closest positive number to a "machine zero" 
(to avoid reaching it itself), E represents an infinum of all operating 
segments length. Then to allocate a segment larger than E in some parent 
segment, parent segment must satisfy: 
    (l-E)/D > E  where l is parent segment width
then
    l > E*(D+1) that's the criteria!

If so then we can allocate Yet another segment in the nest, else we _must_
renumber the whole tree to balance segments width. (The tree_renum proc does
naturally balance all unused segments of all non-leaf nests.
Note: the criteria _precast_ a possible error, no error appears in the data
if use this criteria.

Note: in this article we took a look on positive numbers, only. It's just a
matter of simplicity of explaination. Actually, you also could place your
nested-tree below zero or even around... but you would advance nests width
two times maximum. (If you wish, simply set initial segment [-1,1])

And finally, what E is better to choose?
If we have a segment (l1,r1) and we have to allocate (l2,r2) the child
segment then l2=l1+E r2=l1+E+(r1-l1-E)/D => r2=l1+E+r1/D-l1/D-E/D
!!! E is divided by D each iteration so numbers l and r demand higher
exactness of machine representation than E.
But larger E cause loses of segments length.
I propose to choose E=0. If we append nodes to the tree to right of 
rightmost brother, then the equality of l1 and l2 will not cause problems,
but sort order (simple use r to sort nodes).

Then predicate "forfather" is (t1.l<=t2.l & t1.r>t2.r) 

  

  Described method still demands more more complex example, to illustrate a
normalization procedure and how its result looks when the method applied, but 
it cannot be bounded in a single article.


  The method successfully applied when design the databases (and examples also 
work well under PostgreSQL) on sites: 
market.23region.ru, some similar marketplace-sites,
org.kuban.info
and 
set.kuban.info (B2B-system ordered by local government)