A simple diff algorithm in PHP
A diff algorithm in its most basic form takes two strings, and returns the changes needed to make the old string into the new one. They are useful in comparing different versions of a document or file, to see at a glance what the differences are between the two. Wikipedia, for example, uses diffs to compare the changes between two revisions of the same article.
Solving the problem is not as simple as it seems, and the problem bothered me for about a year before I figured it out. I managed to write my algorithm in PHP, in 18 lines of code. It is not the most efficient way to do a diff, but it is probably the easiest to understand.
It works by finding the longest sequence of words common to both strings, and recursively finding the longest sequences of the remainders of the string until the substrings have no words in common. At this point it adds the remaining new words as an insertion and the remaining old words as a deletion.
You can download the source here: PHP SimpleDiff
WOW very impressive!
Comment by Jeff Kee — May 17, 2007 @ 1:39 pm
This is excellent work, Paul. Thanks for offering it. We’ve been looking for a nice, clean implementation of diff in PHP that is unconstrained by line breaks (we don’t care about where the line breaks are in HTML). We operate a non-profit project that allows volunteers to update our web site, and I need to see where the changes are so I can approve and activate them. I’ve chopped my users’ website updates on whitespace and ‘>’ tag ends, and used your code to present a very clear diff of their changes.
Comment by Bob Wildfong — May 31, 2007 @ 7:22 am
Nice work Paul, this could come in handy sometime
Comment by Hamish M — June 6, 2007 @ 5:42 pm
This is a great algorithm Paul, but I am trying to figure out how this is different from using PHP’s built-in array_diff() and array_intersect() functions to achieve this.
Your post reminds me of the Edit Distance dynamic problem I first came across in this video from MIT:
http://people.csail.mit.edu/bdean/6.046/dp/dp_8.swf
Comment by Rajesh Kumar — February 13, 2008 @ 10:01 am
Rajesh, interesting video. I’ve wondered how to best solve that problem before. I have a hunch that diffing two strings and finding the edit distance reduces to the same problem. However, I used a different algorithm than they used in the video. Their algorithm guarantees an optimal solution, so it might be fun to implement as well.
array_diff() and array_intersect() give the intersection and difference of two sets. When you are comparing two pieces of text, for example, it doesn’t do you much good to know what words have been inserted or removed without knowing where they were inserted and removed from.
Comment by Paul Butler — February 13, 2008 @ 1:40 pm
Ahh! I get it now. Thanks. I’m going to go away and try to implement a recursive algorithm that does the same thing but uses str_word_count($str, 2) which gives me the “where” bit.
Comment by Rajesh Kumar — February 13, 2008 @ 4:36 pm
Rajesh, neat, I’ve never seen str_word_count before, but it should do the trick nicely. I used array_keys() to do the same thing since I was working with an array, but if you just want to diff strings it makes more sense to keep them in string format. Let me know how the implementation goes.
Comment by Paul Butler — February 13, 2008 @ 6:33 pm
I’ve been studying your code and it’s very clever but unfortunately I am not as clever apparently! I am trying to figure out the correct (easy) method to use the diff array it generates to “roll back” the newer text to the old text so just the difference can to be stored.
I think I am getting caught up on how the array key is the word position of the OLD text and not the NEW text? Is that correct? Is there a better way to handle this problem?
Comment by _ck_ — April 12, 2008 @ 8:11 pm
Ck, the code is admittedly a bit hard to read. I found I had trouble understanding it myself when I recently read over it to make the Python version (which you should study instead if you know Python, since it is the better of the two). By easiest to understand, I meant the algorithm itself, not my cryptic implementation of it.
The big for loop is to find the largest substring between new and old. The string variables could be swapped for one another in this loop and the output should still be the same.
Comment by Paul Butler — April 12, 2008 @ 8:35 pm
Well essentially what I’m doing is taking the resulting array your subroutine generates. Then I loop through it an unset anything that’s not an array, so that leaves just the array of “diffs” which are indexed by the word position. That’s what I want to store so if there’s a 1000 word document, and one word inserted and one word deleted, only two small arrays are stored.
The problem is, when I have only the new document to work backwards from, the index positions in your array are based on the old document. It’s occurred to me I could recurse and recount the words every time I effect a change from the array before continuing to the next set and that might “line-up” the word positions but it doesn’t seem to be working and the complexity of the resulting code is giving me a migrane…
So I guess what I am really asking, is there a way to modify your routine so that it stores the index positions based on the new document which would make the reversal (undo) much easier for me? Perhaps after the forward-looking array is generated it could be converted into a reverse-looking one somehow?
(sorry I don’t know the correct terminology, completely self-taught and an amateur at best!)
Comment by _ck_ — April 12, 2008 @ 10:23 pm
I think calling diff with the arguments reversed (diff($new_string, $old_string)) might do what you want.
You might also want to look into delta encoding: (http://en.wikipedia.org/wiki/Delta_encoding).
Comment by Paul Butler — April 13, 2008 @ 6:46 am
With fresh eyes (and migraine finally gone) I figured it out by studying some samples step-by-step. The resulting code to do rollbacks is so simple it’s embarrassing:
($diffdiff is the $diff with all non-array and blank elements removed but all other keys kept intact. $new is the current text, where we want rollback to $old which is our previous edit.)
$old=explode(” “,$new);
$offset=-1;
foreach ($diffdiff as $key=>$value) {
array_splice ( $old , $key+$offset , count($diffdiff[$key][’i']) , $diffdiff[$key][’d'] );
$offset+=count($diffdiff[$key][’d']) -1;
}
$old=implode(” “,$old);
tada!
(I hope the blog software doesn’t eat the code)
Comment by _ck_ — April 13, 2008 @ 11:50 pm
Wow, that is simple.. Thanks for posting it.
Comment by Paul Butler — April 14, 2008 @ 7:59 am