Build your own Penn State community site using Dapper

It has been an observation of mine that some of the best online communities are the ones that most effectively mimic offline communities. Take Facebook for example. In my opinion, Facebook owes much of its success to its members’ use of real names. My news feed doesn’t say “OgReSLAYER91 is friends with Xx1337haXzorxX.” Instead, it could say “Steve Ballmer is friends with Bill Gates.” This adds tremendous value to a social network. To tap into this value, many developers turn to the Facebook API. Sometimes it makes no sense to start a social network from scratch. However, for Penn State-oriented developers, there is a decent alternative.

Penn State has an online directory, psu.edu/ph, where anyone can enter information about a student or faculty member at Penn State and find out information such as their full name, major, address, and a few other bits of miscellany. Using Dapper, I made a script which outputs a Penn State student’s or faculty member’s full name given their user id.

So a Penn State-oriented community site or social network would accept only those with a “userid”@psu.edu email address. Here’s a little proof of concept script to get started:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
< ?php
$userid = explode("@", $_GET[userid]); //user inputs email address
$domains = explode(".", $userid[1]);
if ($domains[0] !== "psu"){ //checks for psu domain
echo "<p>Sorry, we are only accepting registration from Penn State students.";
}else{ 
$xml = simplexml_load_file("http://www.dapper.net/RunDapp?dappName=psuph&v=1&v_userid=".$userid[0].""); //runs my dapp
$name = $xml->name;
$pieces = explode(" ", $name);
function uppercase($string){
$string = ucfirst(strtolower($string)); //converts all uppercase names to only first letter uppercased 
echo $string;
}
echo "<p>Welcome, ";
uppercase($pieces[0]); //outputs first name
echo " ";
uppercase($pieces[1]); //outputs middle name
echo " ";
uppercase($pieces[2]); //outputs last name
echo "!</p><p><i>a confirmation code has been sent to ".$_GET[userid]."</i></p>";
}
?>

The script accepts an email address then checks if it’s from Penn State. If it is from Penn State, the script (via Dapper) scrapes the person’s name from Penn State’s ph server, parses some xml and then outputs a nicely formatted name.

I’ve been playing around with this Dapp, trying to have it collect other data, such as addresses, but it is much less accurate.

I’d like to get a flash widget of this Dapp up, but have thus far experienced some difficulty. Until then, you can run the actual script here. Feel free to change the userid in the address bar of your browser to test it out yourself.

Penn State’s Schedule of Courses — un-”dapped” potential

Daehee Park recently wrote about scaping Penn State’s Schedule of Courses to create a free API of course data. His method of choice was to use Beautiful Soup for Python. As I have yet to learn Python, I will be taking a different approach towards the same end: Dapper. Dapper bills itself as a tool for creating an API for any website–hopefully this includes schedule.psu.edu.

To use Dapper, you must “train” it how to scrape your desired data. It supposedly works best on templated pages. You simply supply the “Dapp factory” tool with several urls of pages that use the same template. Then you highlight the desired data to scrape from each url. This process is very hit-or-miss. Selecting one piece of data may cause other data to be sporadically selected. Sometimes this data is what you wanted (in which case you may find yourself praising its algorithm as genius) and sometimes the data isn’t quite what you had in mind (in which case you’ll be cursing the algorithm for wasting so much of your time). Ilan Flint, of Dapper Support wrote:

Dapper’s method of action is finding an algorithm that will “understand” , in machine terms, what is the content that you wanted to choose.

The purpose and meaning of each element in a web page, something that is very intuitively understood by you, is completely incoherent to a computer, and that’s the gap that Dapper bridges.

Once your “dapp” has been created, you can choose from a wide variety of output formats including xml, rss, iCal, Google Maps, and html to name a few. There is even a feature where you can “link” the output of one dapp to be the input of another dapp.

While Dapper has huge potential, it can be very unrefined at times. It has been slow/crashing more than half the time I’ve been using it. I found that it works best using the Opera browser instead of the more mainstream Firefox and Safari. The folks over at Dapper are aware of these issues and are working on them. This is what Flint wrote to me in a support email reply to a question about the site’s performance issues:

What you experienced was a temporary issue - one that hopefully
will not repeat itself.

Dapper is aware of these issues popping-up every now and then,
and is constantly working hard to improve the stability and reliability.

You’ve still gotta give the Dapper people credit. Their interface uses ajax–something which can be prone to crashing and is still full of compatibility issues. I think they will continue to face an uphill battle when it comes to ensuring a smooth user experience.

Nevertheless, I can see many neat possibilities using the Penn State Schedule of Courses and Dapper. Here are some that sprang to mind:

  • an rss feed for the number of seat openings left in a given course
  • a Google Maps mashup of a student’s course schedule, showing schedule possibilities that require the least amount of walking (or the most amount of walking if you’d like to stave off those freshmen 15)
  • an iCal or Google Calendar mashup in which a student’s schedule can be easily imported into one of those formats

I’ll be trying to bring these ideas to fruition, but feel free to beat me to it.

In the meantime, you can play with a flash widget of a dapp I made. Simply type in a semester, campus, and department and the widget will return a list of course numbers. Don’t know where to begin? Try typing “fall” for semester, “up” for campus, and “pl_sc” for department. There may be some errors and I make no claims about the accuracy of the widget.

 Add to your site   powered by Dapper 

Recursion-less storage of hierarchical data in a relational database

One of the main problems with using a relational database such as MySQL is that it can be tricky to store and retrieve hierarchical information. Hierarchies are expressed with tree structures whereas relational databases use flat tables. So how can one fit this square peg into a round hole? Fortunately, there are several methods out there. Unfortunately most of these methods rely on recursion for either populating, updating, or deleting parts of tree. Check out this evolt article to learn more about the pitfalls of recursion (and even more ways to store hierarchical data). Here’s a quick overview of two methods which require recursion and one which never requires it.

Method 1: Parent/child

This is the most straightforward method.

id element parent
1 fruit null
2 orange 1
3 apple 1
4 granny-smith 3
5 red delicious 3


Each row contains the name of the element, its unique id and the unique id of its parent element. Every child has one and only one parent but a parent may have any number of children.
a hierarchy of fruit
A php function for converting such a table into a hierarchical tree could look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
< ?php
require_once "dbconnect.php";
 
function show_hierarchy_1($rootid){
 
$query = "SELECT * FROM table WHERE parent  '$rootid'";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
     echo $row[element];
     echo "<br />";
 
     show_hierarchy_1($row[id]);
 
 }
}
?>

Notice that this function is recursive. While dead-simple to understand, this function will use more and more processing resources as the database grows in size. There must be a better way. Enter method two.

Method 2: the modified preorder tree traversal algorithm

When I first read Sitepoint’s article about this, I was excited to find a method that did not require recursion to retrieve the hierarchy. Have a look-see:

element lt rt
fruit 1 10
orange 2 3
apple 4 9
granny-smith 5 6
red-delicious 7 8


Each element has a left value and a right value. The values are determined by traversing the tree starting from the left of the root element. This diagram can explain the process far easier than words can:
using the modified preorder tree traversal algorithm

Now this is where the magic happens. A php function to retrieve the whole tree would look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
< ?php
require_once "dbconnect.php";
 
function show_hierarchy_2($rootid){
 
$query = "SELECT * FROM table WHERE id = '$rootid'";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
     $query2 = "SELECT * FROM table WHERE lt &gt; $row[lt] AND rt &lt; $row[rt]";
     $result2 = mysql_query($query2);
     while ($row2 = mysql_fetch_array($result2)){
 
         echo $row2[element];
         echo "<br />"; 
 
  }
 }
}
?>

The simplicity of retrieving a hierarchy given a parent element’s left and right values is astounding. With a single query of the database, an entire tree is displayed, no recursion necessary. Unfortunately the modified preorder tree traversal algorithm has some potentially deal-breaking downsides. The first downside is the issue of populating a table with left and right values. Gijs Van Tulder (author of the Sitepoint article) provides such a method, given a table that already has parent/child values for each row.
The following is Tulder’s work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
< ?php
function rebuild_tree($parent, $left) {
   // the right value of this node is the left value + 1
   $right = $left+1;
 
   // get all children of this node
   $result = mysql_query('SELECT title FROM tree '.
                          'WHERE parent="'.$parent.'";');
   while ($row = mysql_fetch_array($result)) {
       // recursive execution of this function for each
       // child of this node
       // $right is the current right value, which is
       // incremented by the rebuild_tree function
       $right = rebuild_tree($row['title'], $right);
   }
 
   // we've got the left value, and now that we've processed
   // the children of this node we also know the right value
   mysql_query('UPDATE tree SET lft='.$left.', rgt='.
                $right.' WHERE title="'.$parent.'";');
 
   // return the right value of this node + 1
   return $right+1;
}
?>

Tulder writes:

The recursion makes this a fairly complex function to understand. However, this function achieves the same result we did by hand at the beginning of this section. It walks around the tree, adding one for each node it sees. After you’ve run this function, you’ll see that the left and right values are still the same (a quick check: the right value of the root node should be twice the number of nodes).

Remember, we want to avoid recursion. Recursion is slow. Furthermore, modifying the tree requires recursion. Tulder explains two methods:

The first option is simple. You use the adjacency list method for updating, and the modified preorder tree traversal algorithm for retrieval. If you want to add a new node, you just add it to the table and set the parent column. Then, you simply rerun the rebuild_tree() function. This is easy, but not very efficient with large trees.

The second way to add, and delete nodes is to update the left and right values of all nodes to the right of the new node.

Tulder’s second method may be very light on system resources in one circumstance, and very heavy in another. For example, adding an element whose parent is the root element will not spare much processing power.

And this brings us to the wonderful third way.

Method 3: David Chandler’s patent

David Chandler, apparently is a programmer. Google him around if you want. The important thing is that he devised a method of storing hierarchical data in a relational database where recursion is never needed.

You can read the patent here. (Requires free registration)

Here’s the table:

element L1 L2 L3
fruit 1 0 0
orange 1 1 0
apple 1 2 0
granny-smith 1 2 1
red-delicious 1 2 2

With this method, there will be a database column for each level of depth. The root element will only have the L1 column set. Its children will have the L1 and L2 columns set and their children will have the L1, L2 and L3 columns set and so on and so forth. If you have no idea how “deep” your hierarchy might be, this method may not be for you. Otherwise, keep reading.
using chandler\
And here’s a php script to display the whole tree:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
< ?php
require_once "dbconnect.php";
 
function show_hierarchy_3(){
$query = "SELECT * FROM table WHERE l1 != '0' ORDER BY l1, l2, l3, l4";
$result = mysql_query($query);
while ($row = mysql_fetch_array($result)){
 
	$level = 1;
 
	if ($row[l1] != '0' ){
		$level = $level + 1;
	}
 
	if ($row[l2] != '0' ){
		$level = $level + 1;
	}
 
	if ($row[l3] != '0' ){
		$level = $level + 1;
	}
 
	echo str_repeat('___',$level);
	echo $row[element];
	echo "<br />";
 
	}
}
?>

Let’s say I want to add a child to the “orange” node:

1
2
3
4
5
6
7
8
9
10
11
12
13
< ?php
require_once "dbconnect.php";
 
mysql_query("INSERT INTO `table` (
		`element` ,
		`l1` ,
  		`l2` ,
  		`l3`
		)
		VALUES (
		'valencia', '1', '1', '1'
		);");
?>

using chandler\
And that’s it. The tree was updated without recursion. Deleting nodes is very similar to adding, just use DELETE FROM. There’s no update_tree() function to run, you can just run the show_hierarchy_3() script.

Chandler’s method could be considered an adaptation of the path enumeration model. The main difference between the two is that the path enumeration model uses a single database column with a delimiter to show an element’s hierarchy, whereas Chandler’s method uses multiple columns. In the patent, Candler writes:

Storing the digits of the Outline Number in separate columns allows for constructing very efficient queries of a type not possible with the prior art methods. In particular reference to the method shown in FIG. 3 , consider the steps required to move a data element within a tree (i.e. changing the Outline Number). First the Outline Number string needs to be parsed into the respective digits, then adding and subtracting the appropriate digits as necessary would need to be performed, and then the digits need to be converted back to string values and the new Outline Number reconstructed. By storing the digits of the Outline Number in separate columns the present invention substantially simplifies these operations by avoiding performing string operations that are typically slow and cumbersome to write, and by quickly isolating and operating on the appropriate digit in the Outline Number.

So there you have it. Enjoy.