Recursion-less storage of hierarchical data in a relational database

One of the main problems with using a relational database such as MySQL is that it can be tricky to store and retrieve hierarchical information. Hierarchies are expressed with tree structures whereas relational databases use flat tables. So how can one fit this square peg into a round hole? Fortunately, there are several methods out there. Unfortunately most of these methods rely on recursion for either populating, updating, or deleting parts of tree. Check out this evolt article to learn more about the pitfalls of recursion (and even more ways to store hierarchical data). Here’s a quick overview of two methods which require recursion and one which never requires it.

Method 1: Parent/child

This is the most straightforward method.

id element parent
1 fruit null
2 orange 1
3 apple 1
4 granny-smith 3
5 red delicious 3


Each row contains the name of the element, its unique id and the unique id of its parent element. Every child has one and only one parent but a parent may have any number of children.
a hierarchy of fruit
A php function for converting such a table into a hierarchical tree could look like this:

< ?php
require_once "dbconnect.php";
 
function show_hierarchy_1($rootid){
 
$query = "SELECT * FROM table WHERE parent  '$rootid'";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
     echo $row[element];
     echo "";
 
     show_hierarchy_1($row[id]);
 
 }
}
?>

Notice that this function is recursive. While dead-simple to understand, this function will use more and more processing resources as the database grows in size. There must be a better way. Enter method two.

Method 2: the modified preorder tree traversal algorithm

When I first read Sitepoint’s article about this, I was excited to find a method that did not require recursion to retrieve the hierarchy. Have a look-see:

element lt rt
fruit 1 10
orange 2 3
apple 4 9
granny-smith 5 6
red-delicious 7 8


Each element has a left value and a right value. The values are determined by traversing the tree starting from the left of the root element. This diagram can explain the process far easier than words can:
using the modified preorder tree traversal algorithm

Now this is where the magic happens. A php function to retrieve the whole tree would look like this:

< ?php
require_once "dbconnect.php";
 
function show_hierarchy_2($rootid){
 
$query = "SELECT * FROM table WHERE id = '$rootid'";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
     $query2 = "SELECT * FROM table WHERE lt > $row[lt] AND rt < $row[rt]";
     $result2 = mysql_query($query2);
     while ($row2 = mysql_fetch_array($result2)){
 
         echo $row2[element];
         echo ""; 
 
  }
 }
}
?>

The simplicity of retrieving a hierarchy given a parent element’s left and right values is astounding. With a single query of the database, an entire tree is displayed, no recursion necessary. Unfortunately the modified preorder tree traversal algorithm has some potentially deal-breaking downsides. The first downside is the issue of populating a table with left and right values. Gijs Van Tulder (author of the Sitepoint article) provides such a method, given a table that already has parent/child values for each row.
The following is Tulder’s work:

< ?php
function rebuild_tree($parent, $left) {
   // the right value of this node is the left value + 1
   $right = $left+1;
 
   // get all children of this node
   $result = mysql_query('SELECT title FROM tree '.
                          'WHERE parent="'.$parent.'";');
   while ($row = mysql_fetch_array($result)) {
       // recursive execution of this function for each
       // child of this node
       // $right is the current right value, which is
       // incremented by the rebuild_tree function
       $right = rebuild_tree($row['title'], $right);
   }
 
   // we've got the left value, and now that we've processed
   // the children of this node we also know the right value
   mysql_query('UPDATE tree SET lft='.$left.', rgt='.
                $right.' WHERE title="'.$parent.'";');
 
   // return the right value of this node + 1
   return $right+1;
}
?>

Tulder writes:

The recursion makes this a fairly complex function to understand. However, this function achieves the same result we did by hand at the beginning of this section. It walks around the tree, adding one for each node it sees. After you’ve run this function, you’ll see that the left and right values are still the same (a quick check: the right value of the root node should be twice the number of nodes).

Remember, we want to avoid recursion. Recursion is slow. Furthermore, modifying the tree requires recursion. Tulder explains two methods:

The first option is simple. You use the adjacency list method for updating, and the modified preorder tree traversal algorithm for retrieval. If you want to add a new node, you just add it to the table and set the parent column. Then, you simply rerun the rebuild_tree() function. This is easy, but not very efficient with large trees.

The second way to add, and delete nodes is to update the left and right values of all nodes to the right of the new node.

Tulder’s second method may be very light on system resources in one circumstance, and very heavy in another. For example, adding an element whose parent is the root element will not spare much processing power.

And this brings us to the wonderful third way.

Method 3: David Chandler’s patent

David Chandler, apparently is a programmer. Google him around if you want. The important thing is that he devised a method of storing hierarchical data in a relational database where recursion is never needed.

You can read the patent here. (Requires free registration)

Here’s the table:

element L1 L2 L3
fruit 1 0 0
orange 1 1 0
apple 1 2 0
granny-smith 1 2 1
red-delicious 1 2 2

With this method, there will be a database column for each level of depth. The root element will only have the L1 column set. Its children will have the L1 and L2 columns set and their children will have the L1, L2 and L3 columns set and so on and so forth. If you have no idea how “deep” your hierarchy might be, this method may not be for you. Otherwise, keep reading.
using chandler\
And here’s a php script to display the whole tree:

< ?php
require_once "dbconnect.php";
 
function show_hierarchy_3(){
$query = "SELECT * FROM table WHERE l1 != '0' ORDER BY l1, l2, l3, l4";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
	$level = 1;
 
	if ($row[l1] != '0' ){
		$level = $level + 1;
	}
 
	if ($row[l2] != '0' ){
		$level = $level + 1;
	}
 
	if ($row[l3] != '0' ){
		$level = $level + 1;
	}
 
	echo str_repeat('___',$level);
	echo $row[element];
	echo "";
 
	}
}
?>

Let’s say I want to add a child to the “orange” node:

< ?php
require_once "dbconnect.php";
 
mysql_query("INSERT INTO `table` (
		`element` ,
		`l1` ,
  		`l2` ,
  		`l3`
		)
		VALUES (
		'valencia', '1', '1', '1'
		);");
?>

using chandler\
And that’s it. The tree was updated without recursion. Deleting nodes is very similar to adding, just use DELETE FROM. There’s no update_tree() function to run, you can just run the show_hierarchy_3() script.

Chandler’s method could be considered an adaptation of the path enumeration model. The main difference between the two is that the path enumeration model uses a single database column with a delimiter to show an element’s hierarchy, whereas Chandler’s method uses multiple columns. In the patent, Candler writes:

Storing the digits of the Outline Number in separate columns allows for constructing very efficient queries of a type not possible with the prior art methods. In particular reference to the method shown in FIG. 3 , consider the steps required to move a data element within a tree (i.e. changing the Outline Number). First the Outline Number string needs to be parsed into the respective digits, then adding and subtracting the appropriate digits as necessary would need to be performed, and then the digits need to be converted back to string values and the new Outline Number reconstructed. By storing the digits of the Outline Number in separate columns the present invention substantially simplifies these operations by avoiding performing string operations that are typically slow and cumbersome to write, and by quickly isolating and operating on the appropriate digit in the Outline Number.

So there you have it. Enjoy.

Comments 10

  1. Andrew wrote:

    David Chandler’s patent seems to only be good if you know the depth limit of the tree - e.g. in your example, the limit is 3, so if you want a 4th child, you will have to create a new column. Therefore, this method is only good in certain cases, and not others, such as a discussion forum, because this method could create hundreds of columns in your table.

    Posted 16 Jul 2008 at 2:12 am
  2. admin wrote:

    I agree. For a forum, I would probably end up using Chandler’s method to store topics down to the level of an individual thread. Within each thread, discussions would be stored using the flat table model mentioned in the evolt article.

    Posted 17 Jul 2008 at 8:51 pm
  3. DQNOK wrote:

    I can’t believe he was granted a patent for this! It’s basic art; like building a picket fence. Good grief.

    This method is effectively the “Lineage” method mentioned elsewhere, but is more limited as the max depth must be known ahead of time. In the Lineage approach, an extremely large VARCHAR can be used which will allow MANY layers, but will only use the amount of storage actually required. This patented approach ALWAYS stores the maximum depth, even when it’s not needed. I considered this method for a BOM (Bill Of Materials) program (before I knew the stupid approach was patented!), but ultimately decided it was way too wasteful, and went with the Lineage approach instead. Presently I’m considering rewriting the app to use the Nested Set (or MPTTA) approach.

    Posted 18 Jul 2008 at 4:25 pm
  4. admin wrote:

    Once again, I agree that the need to know the depth is the main disadvantage of this method. However, while there are many cases in which the depth is unknown/impossible to predict, there are just as many cases where the depth will be constant. Take for example the storage of a course catalog at a college. I would have one column for semester, one for campus, one for department, one for course name, and one for sections. I have tried both the mptta and Chandler’s method for storing course data. The mptta becomes VERY slow.

    The two comments I’ve received for this post are essentially examples in which Chandler’s method will not work. That still does not negate the usefulness of it. It is one of many options, including the mptta and the path enumeration model.

    Posted 18 Jul 2008 at 5:21 pm
  5. DQNOK wrote:

    The admin said:
    “That still does not negate the usefulness of it. It is one of many options”

    How can it be an option when it is patented?

    I’ve not used patented software before (unless perhaps by accident because the patent office was so obtuse at to grant a patent to something that any reasonable programmer like myself just “came up with”), but it occurs to me that attempting to manage the rights to use it would become an absolute nightmare unless you KNEW it was only going to be used in one app, and you yourself entirely controlled the distribution. Even then, without some automated way of keeping your distribution mechanism synchronized with your patented code use, you still run the risk of being in violation. A violation COULD spell financial ruin for your employer, and ultimately for yourself.

    I appreciate your point about it being a valid technical option in certain limited circumstances. But for me, the complexity of introducing patented software into my projects far outweighs any performance benefit I might get. In my app with over a quarter million records, the Lineage method (or “path enumeration method”) yielded queries so fast I didn’t even bother to time them. Virtually instantaneous.

    Thanks for the article though. Good explanations and diagrams. I appreciate the work you put in on it.

    Posted 21 Jul 2008 at 1:44 pm
  6. Keli Hlodversson wrote:

    The modified preorder tree traversal way of storing the hierarchy (also called nested set model) does not require recursion to update the tree. The rebuild_tree() function shown above converts a parent/child hierarchy table to a modified preorder tree traversal table.

    To update the nested set table directly, one only has to update the left/right indexes appropriately. See this article at mysql.org for more information on how: http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

    Posted 28 Jul 2008 at 8:46 am
  7. MValdez wrote:

    This is so sad… I have just wrote a piece of code to maintain a very large organizational chart (more than 100,000 users) with a fixed depth (12 levels). After used an adjacency list (the first method) without success (too slow) I decided to use a flat list, with a field for each level… and then I discovered the damn thing has already been patented. Oh, my…

    I agree with DQNOK, using patented algorithms is not that simple if you want to avoid any violation.

    I searched to check if the patent was not really just an application (meaning no patent was granted) but it seems the patent was granted. The algorithm is so simple, why did they do this?

    Urgh.

    Anyway, back to the rewrite my code. I think the mptt will be my answer.

    Regards,

    MV

    Posted 29 Jul 2008 at 4:46 am
  8. PMC wrote:

    I’m in search of the “golden” algorithm to store hierarchical data in a relational database, probably a futile quest, I know. Any one got any ideas?!

    p.s. David Chandler’s patent is a joke, let me guess it’s an USA patent.

    Posted 30 Jul 2008 at 1:35 am
  9. MValdez wrote:

    Cool. As I said before, after trying to use the adjacency list method (one of the recursive methods) and failing due to performance issues, and after discovering my “new” simpler method has been patented by somebody, I started to search for new ways to do build my organization-tree manager.

    After searching (thanks to the ACM for the ACM Digital Library) I discovered several variants of the Lineage method (or model) also called Path Enumeration or Materialized Path (already mentioned by DQNOK), which indeed are part of a family of models called Tree Encodings.

    Anyway, now my software fly: not tens of times, not hundreds of times, but thousands of times faster. From several minutes to few milliseconds with minimal memory usage. (Hey, this is a web application, memory usage is critical.)

    I also learned a bit about some variants of the nested set model, but honestly, maybe I am too stupid but got a headache trying to understand them. Ok, maybe I’m not stupid, but I was in a hurry and the Path Enumeration thing is so simple and intuitive.

    Regards,

    MV

    Posted 20 Aug 2008 at 7:43 pm
  10. GregoryD wrote:

    I did something similar to Chandler’s solution, but it gets more and more unwieldy as you add more levels. Actually, my original concept was independently derived and identical to his.
    I have no idea how someone could get a patent for something so completely obvious.

    But we’re a very small web shop with a very limited amount of time (as in, I’m the only developer, but my non-CS trained manager has to have a hand in everything lest I leave the company and he remains clueless), and one of the snippets of code we use is a non-ajax driven javascript that uses option values and CSS class identifiers to drill down to the appropriate data.

    So if Fruit gives me an option to pick a fruit, then Apple will allow me to use javascript to match the Fruit option in one select box with the Apple option in the second. and Granny Smith will allow me to do the same between a second and third box.

    Chandler’s patent has another problem (given my example is more a problem of manager oversight than theory): it requires you to maintain a count of the items on a specific level.

    So, given the two issues, here was my solution:

    id parent l1 l2 l3
    1 fruit 1 0 0
    2 apple 1 2 0
    3 orange 1 3 0
    4 pear 1 4 0
    5 granny s 1 2 5
    6 blood or 1 3 6

    Basically, the appearance of an id at a level delimits the item to be at that level. I know that fruit is a top level item because it has 1 0 0. An apple is a 2nd level item because its l2 value equals its id. a granny s is a child of apple which is a child of fruit because its levels are 1 2 5. And so on and so forth, as many levels as you want to use.

    I no longer need to know how many varieties of apples I have, I just need to know that I want to place one in the table.

    No matter how many levels you have, all the nth-level child needs to worry about is who its ancestors are. The value at ITS level doesn’t matter unless it has children.

    Posted 18 Apr 2009 at 7:32 pm

Post a Comment

Your email is never published nor shared. Required fields are marked *