<data> <name>Kevin</name> <name>Catherine</name> </data>
(feed 2):<data> <name>Brady</name> <name>Timothy</name> <name>Kevin</name> </data>
should become (master feed):<data> <name>Brady</name> <name>Timothy</name> <name>Kevin</name> <name>Catherine</name> </data>
Simple right? For a human yes, but a computer…not so much. Let’s break it down a bit and see… The first problem I need to solve is comparing two XML feeds (one from storage and the latest one from the service). While it seems like a simple task, the problem really comes down to the fact that you have two sets of data with each feed (the XML markup and the actual data). The bad news is that generic XML documents do not require some special formatting or order - we can not reliably count on a simple DIFF algorythm. The good news is that, in my case, things are a little easier than a generic XML document compare since i know I’ll be comparing feeds that have all the same tags. However, the feeds do change incrementally (check the example above), and I can’t assume that data outside of the XML structure is going to be consistent (ie. spaces and line feeds for example). So the first thing we need to do to solve this problem is get all of our XML feeds into the same format…ideally with one 'record’ per line…and then all we will have to do is compare the feeds line by line to see what 'records’ are missing from one or the other and merge them. So first a simple (and very generic) perl solution to converting our feed into a 'one’ line per 'record’:sub standardizefeed { # format feed so that we can compare apples to apples my $class = shift; my $feed = shift; my $tag = shift; $feed =~ s/\n//g; $feed =~ s/>(\s+)</></g; $feed =~ s/<$tag/\n<$tag/gi; $feed =~ s/<\/$tag>/<\/$tag>\n/gi; $feed =~ s/\n\n/\n/g; return $feed; }
and now a simple (and very generic) perl solution that uses the previous example to handle the actual compare of my two feeds:sub mergefeeds { # we want to merge two feeds together # expects two feeds and two text values of the tag to wrap lines by my $class = shift; my $feed1 = shift; my $feed1tag = shift; my $feed2 = shift; my $feed2tag = shift; my $mergedfeed = “”; # first format each feed so that they are apples to apples $feed1 = $class->standardizefeed($feed1, $feed1tag); # do the same thing to the 2nd feed $feed2 = $class->standardizefeed($feed2, $feed2tag); # now go line by line and creating a master document @lines1 = split(’\n’, $feed1); @lines2 = split(’\n’, $feed2); # pop the last line off so we can add it back at the end. $lastline = pop(@lines2); # also remove the first and last line from the 2nd feed so we don’t duplicate it shift(@lines1); pop(@lines1); foreach(@lines1) { $thisline = $_; $dup = 0; foreach(@lines2) { $compline = $_; if ($thisline eq $compline) { $dup = 1; } } if (!$dup) { push(@lines2, $thisline); } } push(@lines2, $lastline); # put it all together $mergedfeed = join (“\n”,@lines2); # and return it return $mergedfeed; }
and you would merge our example with something simple like:$masterfeed = feedstorage->mergefeeds($feed1, 'name’, $feed2, 'name’);
It’s not bullet proof (by a long shot)…but it gets the basic job done for my needs (so far), and since it’s all within a closed/controlled environment I can feel pretty good about leaving it for now and moving on to the next part of the challenge (doing something with this larger stored feed)…This post has received 46 loves.
This is the personal blog of Kevin Marshall (a.k.a Falicon) where he often digs into side projects he's working on for digdownlabs.com and other random thoughts he's got on his mind.
Kevin has a day job as CTO of Veritonic and is spending nights & weekends hacking on Share Game Tape. You can also check out some of his open source code on GitHub or connect with him on Twitter @falicon or via email at kevin at falicon.com.
If you have comments, thoughts, or want to respond to something you see here I would encourage you to respond via a post on your own blog (and then let me know about the link via one of the routes mentioned above).