A simple way around comparing two XML files

Well I’m finally getting back around to some work on the next set of fubnub features. This set includes the ability to do some more advanced things like group and sort the data within your feeds (or at least that’s the plan). To allow for this ‘advanced’ stuff, I had to transition the service from a simple pass through service to one that stores or aggregates the feeds until publish time. And since you can schedule your posting to be just about any time (including days/weeks/months down the road), I needed to figure out a 'simple’ storage system that would allow me to store the least amount of data as possible. For the initial version I’m storing the data in a table within the fubnub database, but the longer term plan will be to dump the feeds out to S3 or some other cloud storage system. However, even with the 'limitless’ storage space in the cloud, there is an issue of cost (you do still have to pay for every transfer and the space you take up – if Office Space taught us one thing it’s that fractions of pennies do add up!). So, the basic plan is to regularly hit the feed provider (right now once an hour) and pull the latest feeds…then compare that version of the feed to the version of the feed I already have in storage…append any additions and restore it (if the feed hasn’t changed, don’t make any changes to my stored feed). Here’s a quick example to show what we want to do: (feed 1):

<data>   <name>Kevin</name>   <name>Catherine</name> </data>

(feed 2):

<data>   <name>Brady</name>   <name>Timothy</name>   <name>Kevin</name> </data>

should become (master feed):

<data>   <name>Brady</name>   <name>Timothy</name>   <name>Kevin</name>   <name>Catherine</name> </data>

Simple right? For a human yes, but a computer…not so much. Let’s break it down a bit and see… The first problem I need to solve is comparing two XML feeds (one from storage and the latest one from the service). While it seems like a simple task, the problem really comes down to the fact that you have two sets of data with each feed (the XML markup and the actual data). The bad news is that generic XML documents do not require some special formatting or order - we can not reliably count on a simple DIFF algorythm. The good news is that, in my case, things are a little easier than a generic XML document compare since i know I’ll be comparing feeds that have all the same tags. However, the feeds do change incrementally (check the example above), and I can’t assume that data outside of the XML structure is going to be consistent (ie. spaces and line feeds for example). So the first thing we need to do to solve this problem is get all of our XML feeds into the same format…ideally with one 'record’ per line…and then all we will have to do is compare the feeds line by line to see what 'records’ are missing from one or the other and merge them. So first a simple (and very generic) perl solution to converting our feed into a 'one’ line per 'record’:

sub standardizefeed {   # format feed so that we can compare apples to apples   my $class = shift;   my $feed = shift;   my $tag = shift;   $feed =~ s/\n//g;   $feed =~ s/>(\s+)</></g;   $feed =~ s/<$tag/\n<$tag/gi;   $feed =~ s/<\/$tag>/<\/$tag>\n/gi;   $feed =~ s/\n\n/\n/g;   return $feed; }

and now a simple (and very generic) perl solution that uses the previous example to handle the actual compare of my two feeds:

sub mergefeeds {   # we want to merge two feeds together   # expects two feeds and two text values of the tag to wrap lines by   my $class = shift;   my $feed1 = shift;   my $feed1tag = shift;   my $feed2 = shift;   my $feed2tag = shift;   my $mergedfeed = “”;   # first format each feed so that they are apples to apples   $feed1 = $class->standardizefeed($feed1, $feed1tag);   # do the same thing to the 2nd feed   $feed2 = $class->standardizefeed($feed2, $feed2tag);   # now go line by line and creating a master document   @lines1 = split(’\n’, $feed1);   @lines2 = split(’\n’, $feed2);   # pop the last line off so we can add it back at the end.   $lastline = pop(@lines2);   # also remove the first and last line from the 2nd feed so we don’t duplicate it   shift(@lines1);   pop(@lines1);   foreach(@lines1) {     $thisline = $_;     $dup = 0;     foreach(@lines2) {       $compline = $_;       if ($thisline eq $compline) {         $dup = 1;       }     }     if (!$dup) {       push(@lines2, $thisline);     }   }   push(@lines2, $lastline);   # put it all together   $mergedfeed = join (“\n”,@lines2);   # and return it   return $mergedfeed; }

and you would merge our example with something simple like:

$masterfeed = feedstorage->mergefeeds($feed1, 'name’, $feed2, 'name’);

It’s not bullet proof (by a long shot)…but it gets the basic job done for my needs (so far), and since it’s all within a closed/controlled environment I can feel pretty good about leaving it for now and moving on to the next part of the challenge (doing something with this larger stored feed)…

This post has received 43 loves.


ARCHIVE OF POSTS



This is the personal blog of Kevin Marshall (a.k.a Falicon) where he often digs into side projects he's working on for digdownlabs.com and other random thoughts he's got on his mind.

Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.

If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).