SimpleDiff in Python

A while ago I posted a PHP implementation of a Diff algorithm I came up with[1]. Since it was well received, and it’s a useful little algorithm to have, I created a Python version as well.

There are a few performance improvements as well. The PHP version creates an array in memory proportional to the square of the size of the input, while the Python version’s array is directly proportional to the size of the input. I also sped up how the algorithm finds the indexes of the “new” elements in the “old” array.

Download simplediff.txt (re-name the file to simplediff.py to use)

[1] It is probably the same algorithm that others use, but I haven’t gotten around to getting an ACM membership to access the related papers

Posted on Feb 13th, 2008 in Python

Tail recursion in Python

After spending a lot of time in Scheme, it’s hard not to think in recursion. So recently when I started to improve my Python skills, I missed having Scheme optimize my tail recursive calls.

For example, consider the mutually recursive functions even and odd. You know a number, n, is even if it is 0, or if n - 1 is odd. Similarly, you know a number is not odd if it is 0, and that it is odd if n - 1 is even. This translates to the python code:

def even(x):
  if x == 0:
    return True
  else:
    return odd(x - 1)

def odd(x):
  if x == 0:
    return False
  else:
    return even(x - 1)

This code works, but only for x < 1000, because Python limits the recursion depth to 1000. As it turns out, it is easy to get around this limitation. Included below is a generic tail_rec function that could be used for most cases where you need tail recursion, and an example of it used for the odd/even problem.

def tail_rec(fun):
   def tail(fun):
      a = fun
      while type(a) == type(tail):
         a = a()
      return a
   return (lambda x: tail(fun(x)))

def tail_even(x):
  if x == 0:
    return True
  else:
    return (lambda: tail_odd(x - 1))

def tail_odd(x):
  if x == 0:
    return False
  else:
    return (lambda: tail_even(x - 1))

even = tail_rec(tail_even)
odd = tail_rec(tail_odd)

It’s not as pretty as the Scheme version, but it does the trick. Of course, the odd/even functions are just for the sake of a simple example and have no real-world use, but the tail_rec function could be used in practice.

Posted on Dec 13th, 2007 in Python

JSSpamBlock 2.0, ImageScaler 1.1

JSSpamBlock and ImageScaler were both originally one-day projects that turned out to be a bit more popular than I expected. Recently I have neglected to update them at all, but with reports of ImageScaler not working on WordPress 2.3, I decided to put a day aside and make some changes I had been meaning to make for a while.

A new version of ImageScaler was released last week (thanks to David Karlsson for doing most if not all of the work). I still got comments that it didn’t work with WordPress 2.3, so I installed WordPress 2.3 myself to see what the problem is. I didn’t have any issues, but I made some changes to ImageScaler that might make it more likely to work. If you still have problems with WordPress 2.3, let me know. I also made another major change - images hosted on other servers were previously ignored by ImageScaler and left as-is. Now they are mirrored on the server and can be re-sized properly. Also, images are now always resized so that the aspect ratio is preserved. You can download ImageScaler 1.1 from WordPress.

The new version of JSSpamBlock doesn’t need a database. It uses sessions instead. I also cleaned up the code a bit and tested it with WordPress 2.3. You can download JSSpamBlock 2.0 from WordPress.

Posted on Oct 20th, 2007 in JSSpamBlock, Image Scaler

JSSpamBlock-like protection for any website

I just noticed a trackback from Brandon Cheketts about a PHP script he has released that lets you incorperate functionality similar to JSSpamBlock in any website, called bcSpamBlock. He also released a WordPress plugin based on JSSpamBlock that uses the script.

Although both plugins take advantage of the same limitation of spam bots - that they ignore JavaScript, the way they verify the codes is different. While JSSpamBlock uses a database, bcSpamBlock uses one-way encryption to verify the codes. Although this is a clever way to do it, I chose not to do it in JSSpamBlock for a reason: Storing the code in a database ensures that, even if a spammer were to write a bot targeting sites with JSSpamBlock, each comment posted would require the bot to parse another page from the server. Each code sent to the browser can only be used once. The problem with not using a database is that you have no way to verify that the codes sent from the browser are being used for the first time, and not the 10th.

Georg Kaindl made similar comments about the database being unnecessary, and I wrote a more lengthy response explaining why it was. He then came up with a clever solution - including the post’s ID in the hash. It still isn’t quite as secure as JSSpamBlock (I hate to use the word “secure” to describe what I admit is “security-by-inconvenience”, but I can’t think of another word that fits), but for all practical purposes it should be just as good. The only difference is that spammers could post multiple comments to any given post while only parsing the page once, while JSSpamBlock would require the page to be parsed once for each comment. The other advantage is that I do not have to rely on the JSSpamBlock user to come up with a unique salt in order for the protection to be secure. bcSpamBlock gets around this in a clever way, by using unchanging environment variables to generate the salt.

Another way to look at it is that generating a random code for each page view does not actually increase security (over using the same code for each page view) unless you use a database. So for a plugin that doesn’t use a database, this only gives the illusion of security. You might as well use the code “4422″ for everything, and it would be just as secure. This might sound bad, but any bot that is currently blocked by JSSpamBlock would be blocked by this as well. The only reason JSSpamBlock does more is to make it harder to write a bot that specifically targets JSSpamBlock. It may sound egotistical to suggest that a spammer would ever bother to write a bot specifically targeting the plugin, but for the extra cost (milliseconds of CPU time), I think it is worth making the plugin future-proof.

Posted on Oct 11th, 2007 in JSSpamBlock

ImageScaler 1.0

This blog has been a bit slow since I started school, partly because of the extra work but also partly because the “just for fun” projects I have been working on have gotten larger. At the same time, I hate to neglect my existing projects to start other ones. Given that, I was very lucky to have David Karlsson, who had released a modified version of Image Scaler, agree to incorperate the original functionality back in so that I could make it an official release. The biggest improvement is that you can now set a maximum width and height, which are used to resize all the images. So if your theme breaks with images over 600 pixels in width, Image Scaler is a graceful way to stop this from happening.

You can download Image Scaler 1.0 from WordPress, where it is hosted.

Posted on Oct 04th, 2007 in Image Scaler

Image Re-hoster

I have been following the progress of a start-up called microPledge for a while now. The site lets you post an idea for software you would use and pledge money towards its creation. I came across a proposal for a piece of software that lets you copy an image from any website to your server in a few clicks. It is a problem I have had before, and I figured I could use the software too, so I made a bid to create it for $10 and I got the project. Before I finished, a few more people pledged on it, so I ended up making $23 for it. This was still not much money for the work I did, but the cool part is that at the end of the project I can release it to the public (in fact, microPledge does this automatically). So the way I look at it is, I would have spent that time writing open-source code anyway, I might as well get a free lunch or two out of it.

hoster_2.png

As far as I can tell, this was the first open-source project to be completed through microPledge. We ran into a few bugs, but Ben Hoyt, one of the site’s founders, was very helpful and fixed all the bugs we ran into within hours. If you want to pitch an open-source project idea to programmers, microPledge is the place to go. Greg, the guy who pitched the idea and pledged towards it, had a good experience as well.

As for the script, basically it sits on your server and lets you add a little button (a bookmarklet) to your browser. When you click the button, all the images on the page you are on have a little cross-hair over them when you mouse-over them, and when you click them they are downloaded to your server and you are shown the URL. This is great for when you come across a GPL or fair-use image so you don’t have to go through the process of manually uploading it to your server, or hotlinking it. You can find the latest version hosted at microPledge.

Posted on Aug 19th, 2007 in PHP

Proper Image Resizing for WordPress

WordPress has a cool WYSIWIG editor that lets you easily resize images by dragging the corner around. The problem is that WordPress does not actually resize the image, it just tells the browser to display it smaller. This means that the full sized image is being sent to the browser, which makes the page load slower and take up more bandwidth. Additionally, most browsers are bad at resizing images, so the images look worse than if they were properly resized.

To get around this, I wrote a WordPress plugin called ImageScaler. I am still waiting for it to be approved by WordPress for hosting, so I have hosted it myself for now. It requires GD (almost all web hosts with PHP will have GD). It should work with PHP 4, but it has only been tested on PHP 5.

Here is an example of a scaled image (click it for the original):

fractal1.png

Download ImageScaler 0.1

Update: the plugin is now hosted by WordPress.

Posted on Jul 30th, 2007 in WordPress, PHP

Garden Path Sentences

I came across a post on the Powerset Blog recently about garden path sentences, or sentences that lead you down the wrong path through a string of words with multiple meanings. For example,

The complex houses married and single students and their families

In this case, most readers would probably think complex was an adjective that modified the plural noun houses. The post ended with a challenge - how easy would it be to create a program to automatically generate these sentences. Since school is out and I have some free time, I tried it myself. I found a decent free xml dictionary, and wrote a Ruby script to parse the important bits (the type of word and alternate forms) into an SQL database. I cross-checked all the words against a word frequency table to make sure there were no obscure words. I then wrote a Python script to put the words together into a (hopefully meaningful, but not often) sentence. I put the Python script onto my server so you can play with it here.

gardenpath.png

As you can see, the sentences that it comes up with are far from meaningful. However, in most cases you can at least see how a reader could be taken down the wrong path (at least in the cases where there is a right path). In the above example, concrete could be an adjective or a noun, and spheres could be a noun or a verb (to form a sphere). Foster could be an adjective or a noun depending on the context, but I couldn’t see the reader seeing it as an adjective here. Certainly the sentence generator leaves a lot to be desired (especially considering that this was one of the better sentences), but I got about as far with it as I expected to. I think it could be improved further with a few modifications:

  • Words in the database are already cross-checked to make sure they aren’t obscure, but often a word will be common as a noun and uncommon as a verb, or vice versa. I didn’t have a dataset that allowed me to determine if this was the case for a particular word.
  • The valency of verbs is ignored. All verbs are assumed to be transitive, even though this information is given in the database.
  • I underestimated the difficulty of having a computer generate a meaningful sentence. It is difficult to determine what verbs are compatible with what nouns, I guess you would need to parse a large amount of English text (perhaps some of Project Gutenberg - I think Wikipedia would not be varied enough but I could be wrong).

I noticed later that Ero Carrera had taken a similar approach to what I did, but with his linguistics experience he better anticipated the problems I ran into. He has some good ideas, and his post is an interesting read.

Posted on Jun 27th, 2007 in Ruby, Python

Endless Google Search

I felt like coding today, so I put together a little hack from an idea I have had for a while. What I came up with is a web search (powered by Google), that loads new search results as you scroll the page down. Try it, it’s actually pretty cool.

Here is how it works: there is a large div element at the bottom of the page just to take up space. When it comes onto the screen, an ajax request is made to the server to get the next 10 results from Google. The requests are made through Google’s SOAP api, which is no longer available, but I had an old API key so I was able to get it to work. I had all the client stuff working within an hour, but Google’s API took a while to figure out. Google uses SOAP, which is powerful but hard to code for compared to a simple GET API. It took me a couple of hours to get the server-side stuff working but it is still a hack, so don’t be surprised if you get an error or some unexpected behaviour.

It was designed for FireFox/Mozilla browsers. The only other browser I have tried it with is IE, which it does not work with. So if you are using Internet Explorer, you won’t see anything interesting.

Try it here

Posted on Jun 06th, 2007 in Web Apps, PHP

JSSpamBlock Modifications

The way JSSpamBlock has evolved since I first released it has reminded me why I love open-source. From day one, I had users pointing out bugs and features they would like added, sometimes even submitting a fix for the bug or adding a new feature in themselves. Here are some modifications I have come across on other blogs:

After Georg Kaindl and I had a discussion on whether a database was really neccesary (he made some excellent points on why this is not the case, though I still maintain that the extra protection is worth the small cost of time), he released a JSSpamBlock modification as a new plugin called simpleAntiSpam. He also came up with a clever way to require that the form be parsed once by the bot for each post (although the bot can make unlimited comments to a post once it has parsed the form). I have considered making this functionality the default in an upcoming version of JSSpamBlock, since it will be more than enough protection for the average user.

More recently, I got a comment from Brandon Checketts, who had modified JSSpamBlock so that the comment field names were different than the defaults. The reason was that even if spam bots adapt to JSSpamBlock, modified field names will throw them off. Although I can’t see anyone modifying their spam bots to specifically get around my plugin, I have always tried to design it as if they eventually would, so this will likely be a feature in future versions as well.

Kevin Pendleton, another user, has ported JSSpamBlock to Perl. His version is a bit simpler; it uses a hard-coded value instead of a randomly generated one. In my experience with bots, this should be enough to block out the vast majority of spam bots.

Posted on May 21st, 2007 in JSSpamBlock
Next Page »