Programming

OpenID

Just added OpenID support to Feed Me. Feed readers are pretty much the perfect use case for OpenID: You really only need identity and nothing more.

It was actually a bit more complicated than I thought, but that might just be because of the learning curve. I think the Python OpenID Library could do with a nicer tutorial, examples, etc.

I tried two different Django implementations (django-openid and django-openid-consumer) and neither of them worked properly with the version of Python OpenID in Ubuntu. I had to apply some patches but in the end still had to patch it manually. Then I realised that I’m not using all the middleware and views, so I just made my own basic library based on that that just implements OpenIDStore and it is now part of the Seymour codebase.

It just felt a bit frustrating, because in the end I had to learn quite a bit about how OpenID works internally (although I’m still no expert) just to act as a consumer..

Anyway.. hope it works - I’ve only tested with myOpenID so far.

Thoughts on the modding communities

I started trying out some mods for Civilization 4 like Fall from heaven 2 which is absolutely amazing. I picked up on some things, though:

  • I hate forums. I just can’t understand how anyone actually gets anything done using them. I can never find what I want / need, information gets duplicated all over the place, most responses are absolutely useless and just cluttering up everything, conversations it isn’t even threaded.. Have these people never heard of good old IRC and mailing lists? Much better signal vs noise ratio. Even blogs would be better because it makes it much easier to ignore comments. But that would make it more difficult for multiple people to participate. On the other hand, the audience might get bigger than if it was a forum. (I’m sure if an article as actually an article it will be picked up more easily by someone outside of the forum like digg or reddit, for example)

  • No one seems to use wikis. A good wiki per project would really help sort out most of my gripes with forums: All the information in one place, always up to date, easy to search, easy to navigate, etc. I see there have been attempts at unofficial ones, but those failed probably because the core team didn’t embrace them. Trac would be a good place to start because it integrates almost everything a project needs.

  • Where’s the source code repository? Has no one heard of cvs, subversion, bazaar, mercurial, git… the list goes on. Why can’t we get up-to-date versions? I understand that these mods have lots of data, but it would be cool if you can at least get the latest versions of the code.. Would make it much easier to contribute.

Mods are generally free and community based efforts so you would think it would be a bit more open than this. I’m not just picking on Civilization based mods, though - I’ve seen this elsewhere too.

Anyway.. I started thinking about this because I have some theories on why the AI in Fall from heaven 2 feels a bit lacking compared to Beyond the sword and maybe it would be easy to try and fix, but now I have no idea where to start. Unless I join the forum madness..

Keywords

Sean B. Palmer wrote something on Keywords and Cadences that I find brilliant. Still processing it all, but at least now I have something written by someone that’s obviously much smarter (and definitely much more articulate) than I am to refer to next time this subject comes up.

I feel that keywords are fairly useless in this day and age for searching. It can be useful to categorise things, but that’s only really for finding related things. This is often the case when people categorise or tag blog posts or photos. I think quite often the category or tag names are more useful to the author and not so much the reader - the reader is probably more interested in “related” items, see A very different approach to tagging. But I’m going off at a tangent and I’m not doing his article justice - there are many interesting things in there.

Dealing with comment spam

We’ve been having trouble with spam on some Dynamo sites, notably e-vent.co.za which has been around in some form for many years and therefore was already well known by spammers when we launched it.

I’ve tried different things to deal with the spam over the months and I thought I might as well post them here so that other people might benefit from it.

Abort on unknown form fields

The first thing I did was to make sure only the form fields that were actually on the form got posted. This is to block the most basic (and abundant) spam bots that just seem to broadcast every possible fieldname. This has the advantage of being cheap and easy to do and it is probably a good idea for security in general. Also, no human being filling in the form can possibly add extra fields. The only way things can go wrong is if the template coder makes a mistake and accidentally adds a field with an unknown name.

Add some honeypot fields to forms

I added some extra hidden form fields to catch those scripts that just fill in every field. Basically if it is filled in, I know it was filled in by a bot. Most of them are smart enough to ignore hidden fields, though and you’ll get better mileage if you make it a normal field. This is ofcourse a usability issue and you’ll have to add an ugly label that says something like “Leave this field blank”. You can try putting it in a div with style=”display: none”, but then smart scripts can probably figure it out and your HTML is still polluted with useless stuff which annoys standards freaks like me.

Check comment syntax

We already don’t allow most HTML tags and we strip these out when we process the comment. We run some kind of defacto-standard filters on the body of the comment to turn things into paragraphs and to make links clickable. I noticed that many bots just send a whole lot of links once for every type of syntax that is commonly used on blogs and forums. This includes the bbcode style [url] tags and also <a> tags. In stead of just stripping them out, I now drop the user back at the form with an error message similar to any other server-side validation error message saying something like “This comment contains tags that are not allowed.”

Therefore it is not even classified as spam, because many humans might make the same mistake. The user can just change his comment and submit it again whereas a spam bot is not likely to try. This has the added bonus that comments end up looking better on average, because link tags don’t just get stripped.

Compare the number of links vs the number of paragraphs

I noticed that most humans will only include one or two links for every paragraph of text. People very rarely paste 10 or 20 links straight after each other, but spammers seem to think this is normal. So, after the comment’s been converted to HTML, I count the paragraphs and the number of links. Then I work out a ratio for the number of links per paragraph and if it exceeds a certain number, I assume it is spam, drop the user back at the form and then he can change it before submitting again. A spam bot is not likely to bother ;) I don’t only allow a fixed number of links, because I feel that if a human takes the time to write a long response and references a lot of things, then that’s obviously valid.

I know this is the step that can most easily block real comments, but I feel it is necessary, because the worst spam comments are the ones that contain half a screen full of links. A user can always edit his comment and try again.

Use Bayesian filtering methods

Lately spam bots got smarter and you see more and more comments that contain one paragraph of text that reads something like “Cool site. Thank you:-)” followed by a link or comments that only consist of one link with the user’s name being the spammy keywords. This made it through my filters, so I had to look at something a bit more complicated.

I went looking for some Python-based Bayesian code and found DivMod’s excellent Reverend. I quickly did some tests using the spams that I logged and the legitimate comments in the database and I am very very impressed with the way it works.

I now have a “ham/spam Bayesian db” file which I load whenever the app’s processes start up and I check comments against that. I have a separate process to periodically update this db with the latest spams and hams. (This is basically so that I don’t suffer the performance penalty of reading the database on every comment and also so that I don’t run into issues with multiple processes loading and updating the same file).

I haven’t rolled this out to the live server yet, because the changes are tied up with other new work, but it looks like it will solve my spam problems completely. (for now) It was a lot easier to do than I thought and I wish I implemented it a long time ago.

Ideas I haven’t tried

  • You can check the HTTP referer to see if it matches your comment form’s URL, but I suspect most spam bots behave a lot like humans nowadays and will populate this correctly.

  • You can do an extra keyword check on the commenter’s name. Recent bots started putting their keywords here.

  • You can always require captchas or even site-specific passwords that you publish near the comment form. This is annoying to frequent commenters, though and apparently some smart bots are quite good at dealing with captcha images. It will probably put extra unnecessary load on your server to generate the images too. I think it is a messy solution.

  • I thought of implementing a double opt-in for comments with scores close to the cutoff point. Basically you send the commenter an email with a link he has to visit in order to activate the comment or maybe his entire session. This requires that you track extra state, though and further complicates things, but I’m fairly sure that few bots would receive the email and visit the link. You can combine this with setting something in the user’s session even.

  • Kindof similar to the method above, but in stead of doing a double opt in you present the user with an extra step after commenting which is a captcha form. Very intelligent spam bots might handle the captcha on the initial comment form, but a second form will probably confuse them. Most users will never see this and if you’re smart they will only see this once at most. It doesn’t require that you store extra state with the comment (is_approved) and it means that human beings would be allowed to moderate their own false positives without changing their comments.

  • You can implement an automatic whitelist. Once a person added multiple successful comments you can become more lenient in what you allow from him. There are various ways this can be implemented. See this article for an example of how it can be implemented for email.

Do frameworks discourage thinking?

Have been working quite a bit on a Django powered app the past two months or so. Over that time I realised that it doesn’t feel like I’m using Python on a daily basis. I think most of the interesting Python coding happens inside the framework’s source code, so the bit of Python coding I do is basically just playing around inside whatever framework I’m using.

The bulk of the work I’ve been doing has been Javascript and CSS (the admin interface is very ajaxy). The server-side stuff (apart from maybe the models and manipulators) have been a no-brainer so far - simply routine stuff. The most trouble I had was fiddling around with some deployment options like LigHTTPd, FastCGI, proxying to the dev server, rewrite rules, etc.

I don’t really know if this is a good or a bad thing. On the one hand, I feel like I’m not getting my daily healthy dosage of thinking in Python code, but on the other I can see how this can be a good thing in bigger teams. Juniors can easily pick up what’s going on, others will know where to slot in their code and things will just make sense because it all follows the same pattern (assuming your javascript and css is of the same standard).

url mapping/routing/dispatching in CakePHP

I ran into a limitation in cakePHP’s dispatcher - it is not domain sensitive. You can’t point /news/ or /projects/ or whatever to different controllers for different domains. Sure - you can check the domain in “news controller” or “projects controller” or whatever, but keeping them separate is what I’m after. Basically.. I have one cms that powers a bunch of related sites. These sites all have their own quirks, but a large part of the functionality is common.

This problem doesn’t only affect cakePHP - it seems to affect all the mayor web frameworks. I know Django’s url dispatcher and Routes (a python implementation of Rails’ routes) doesn’t allow you to treat different domains differently either. As far as I know (and I speak under correction here) cherrypy (and therefore turbogears) also doesn’t really allow you to send things to different places based off the domain name.

I suppose I have a bit of a “fringe case”, but I did have this problem before as well. The solution there was different, though. I knew the admin interface would only be on one domain and for all the others I didn’t know the urls beforehand anyway (and therefore had to calculate it off my data), so I just sent everything for the frontend to one controller action and then worked from there.

This time, I’ll just hack the dispatcher to include a route configuration file per domain and I’ll put all the controllers and views in domain-specific folders. (something like that)

I wonder if other people had the same problem and how they tackled it..

on frameworks

Guido van Rossum recently blogged about the difference between libraries and frameworks. He gave Neil Schemenauer’s idea: a framework is basically an application with lots of hooks.

I just realised that a framework is what arises as a side-effect of refactoring your code. This is what happened with Rails and Django - they wrote apps and then extracted frameworks from them. These frameworks might be a good starting point for your own app, but if you do anything non-trivial, you’re likely to extend the framework and not just “use” it.

Therefore, worrying about frameworks is likely to just be a waste of time. Something I noticed recently when useing cakePHP for a project at work is that I quickly begin changeing cake (the framework) to fit my needs and if I don’t keep a close eye on it, then pretty soon I’ll have made so many changes that I might as well fork the project. My project is now the framework and I just hook things into it here and there. Which hilights again exactly how fine the line is between applications and frameworks..

PHP: My 2 cents

PHP bashing is trendy

I’ve seen quite a few links bashing PHP lately. I haven’t been pointing out PHP’s flaws myself, because that would just detract from me complaining about how much Coldfusion sucks ;) Luckily from the end of the month I won’t have to deal with Coldfusion anymore, so I can jump on the PHP bashing bandwagon again.

There are some relevant links here, here and here. I’m going to attempt to add to that and hopefully my arguments won’t be too muddled.

I’m not going to try and convince anyone to shift over to the other extreme (Java or .NET) because I’m a great fan of dynamic languages - I prefer Python to Ruby purely because of aeshetic reasons, but both of them have some good frameworks and libraries to help with web development. Most of it only came out in the past year or two. For Python there is (among others) Django and Turbogears, for Ruby there is Ruby On Rails. No single framework will work for all types of projects and maybe you should rather just use some libraries, but that’s a separate discussion altogether.

Magic Quotes, Register Globals and other incompatibilities

The magic quotes setting really is a pain. You cannot just set the ini setting inside your script, because by then it already did the damage. If you use a database engine that requires values escaped differently (postgresql or sqlite, for example), then you have to unescape everything (if magic quotes is on) and then escape the values again.

register_globals is a minor annoyance. Anyone with a brain will develop code with register_globals off, but you never know what will happen when your code gets deployed on someone elses server that’s got it turned on.

The fact that they sometimes make big changes between minor versions is really bad as well. 4.x.2 might not even be backwards compatible with 4.x.1 and by the time you reach 4.y.0 everything breaks, only for half of everything to be changed back to the way it was by the time they reach 4.y.3.

Sure you can code around all of this, but you really shouldn’t have to. The fact that PHP can differ so much between different minot versions, servers and deployments makes development and testing a lot more difficult than it should be.

Error handling

The fact that by default PHP just continues on errors and guesses values for things that are not defined, etc. is absolutely terrible. This is usually the first thing I disable and try and protect against. I usually try and catch absolutely any error, warning or notice and (if it is a development server) rather crash as loudly as possible, but this is actually a lot harder and less obvious than it should be. I’ve seen catastrophic things happen because (for example) two bits of string got appended together, the second variable wasn’t defined, PHP decided to use a blank string and this string then later formed part of a file path.

To add to the annoyance, there’s no proper/standard way for errors to be reported and handled. I really prefer exceptions and stack traces with as much debugging and context-related info as possible. You know - what any decent modern language and framework provides. Programming in an environment that doesn’t have good exception handling (try and except) can be very painful once you’re used to it. This is probably a bit off topic, but I really miss introspection and an interactive interpreter as well.

Suitability to other types of programming

PHP is only suited (argueably, at least) for writing scripts that run inside a webserver. In my experience, most systems usually require some other types of programming like some cron jobs, import and export of data, automated testing, debugging and analysis, maybe some services/daemons, etc. PHP is definitely not well suited for any of those.

This might not be relevant to everyone or even most people, but I ended up writing a big chunk of a system in some other language on more than one occasion. This meant that I couldn’t reuse any of my code and it introduced more dependencies and generally just more things that can break.

Shared nothing

PHP’s “shared nothing” style is probably a good thing in that it is safe for people that aren’t really programmers, but it is definitely not always the best solution. In fact, it seldom is. I really miss an application scope where things that don’t need to be retrieved or calculated on every request can live. It is often a serious problem how everything you need gets loaded up, included, compiled, etc. on every request. Yes there are accellerators that cache templates, but they are workaround and don’t fix the underlying problem in my opinion.

Ugly urls by default

PHP is still a lot like CGI which effectively maps urls to files. Urls that end with .php?a=1&b=2, etc are ugly and they tie things to the underlying implementation language. What if you rewrite your site in a different language one day? Then you have all those incorrect links to your content lying all over the web. You can get around this by using mod-rewrite or something similar, but in my opinion it often locks people into a “urls point to files” instead of “urls point to resources” way of thinking which is inhibiting imho.

The language and libraries suck ass

I read how someone once said that PHP syntax looks like the result of a drunken alley fondle between c and Perl. That’s a pretty accurate description in my opinion.

Similar to the way I see asp and Coldfusion, PHP started as a kind of template engine where things like database connectivity and some (slightly) more powerful programming concepts got added bit by bit later without any vision or attempt to work towards some consistent goal. The result is that it doesn’t have all the cool language features the other popular dynamic languages (Python and Ruby) have and the naming style of functions and things are as inconsistent as… I can’t even think of an analogy. That means you constantly have to refer to the documentation for things that really should be intuitive.

The builtin data types are just weird, no nice design patterns like iterators, no concept of convention in general, scoping is just plain terrible, etc.

PHP 5

PHP 5 fixes some of these things (but not the most important ones), but is so incompatible with PHP 4 that it really should just be treated as learning a new language. And as far as learning a new language goes, you can do much better.

My two cents on Ruby vs Python comparisons

Update: I changed the article a bit because after reading through it I realised that for someone who moans about uninformed comparisons, I definitely make too many uninformed comparisons myself. Initially this was just going to be a link to Ian Bicking’s blog with maybe my own conclusion, but then I just started japping on and on. I also added a link to TurboGears.

I’ve seen a lot of (usually biased or at least uninformed) Python vs Ruby comparisons on the web lately. Some of them were even titled something along the lines of “Ruby on Rails vs Python”. uhum.. Rails is a framework and Python is a language.. I don’t even bother following links like that. Python nowadays has Django, anyway, which looks very interesting and is probably better for comparisons with Rails, but from what I’ve heard they aren’t even that alike. (Update: TurboGears also looks very interesting. I like the fact that it doesn’t suffer from “not invented here” syndrome.)

Rails was really the first thing that put Ruby onto the map. I’m pretty sure the amount of Ruby developers increased by an order of magnitude since the release of Rails. If you don’t believe me then make a list of apps written in Ruby that doesn’t use Rails and then make a list of apps written in Python. If you still don’t believe me, check wikipedia. Often I’ve used apps for a long time and then only later realized that it was written in something other than C. Maybe this is because I’m an Ubuntu Linux and Gnome user (they do seem to favour it), but I’ve also seen quite a few things like shell utilities written in it (things that used to typically be written in Perl). Python also seems to be embedded as a scripting language into apps more often. I know this is probably irrelevant when comparing languages, but more projects mean more programmers mean more libraries, etc.

More programmers also mean more bloggers. But then why do I see so many more pro-ruby opinions on the net? Are Ruby supporters just generally Zealots while Python programmers shut up and get on with writing code, or is Ruby really better than Python? Is it all just a stupid waste of time like the editor wars? This bugged me for a while and I started thinking that maybe I’m missing out on something.

So I thought I would learn Ruby and then use it if it is better or write a big informed comparison between the two. I set out to learn Ruby and quickly became bored and discouraged the moment I realised that I don’t like the syntax. So much for me adding anything of value to the discussion. (I also think that it is near impossible to learn a language properly unless you are actually using it)

Ruby people usually quickly mention closures. I was pretty sure that I could retaliate with “well… Python has x, y and z language features that Ruby doesn’t!”, so I started searching for just that and then I found this. I think it is a really good comparison by someone who’s obviously a lot smarter and less biased than me. It basically leaves both languages tied.

So.. I have decided to conclude that it boils down to things like personal taste, what you’re used to, which language you learned first, what you’re comfortable with, etc. I discovered sites like lesscode.org and I’m starting to see a bigger picture: Exposure to dynamic languages and alternative platforms in general is probably a good thing. The Java, .NET, etc. guys that discover Ruby and Python and then start to read up about it and maybe try it out is likely to discover both languages and then compare them for themselves. In other words: exposure to one will soften them up for both. Besides.. discussing whether or not I like Ruby is distracting from more important things like “How much I hate almost all other languages” ;) If I get to choose between using Ruby or anything other than Python, I would definitely choose Ruby.

At least until the next new (better) language comes along.

dirt simple

Phillip J. Eby is a smart guy. From his post Chandler begins recovery from XML (which was also featured here):

The cost of adding things you don’t need is really, really high. Luckily, OSAF believes that it’s more important to get things right, than it is to keep throwing money down a rathole to justify the money already spent. I’ve certainly worked for organizations where the reverse is true, though, including one that threw away tens of millions of dollars trying to replace a small, well-designed Python application with an expensive piece of “enterprise” crapware. Ah, the things I could’ve done with that budget! Well, probably I just would’ve given everybody raises and maybe hired a few more people. Or maybe spun off my group as a company that would sell the software to other companies. Heck, we could’ve used it to buy free sodas for life for everybody working in the company and got more value for the investors than what was actually done with the money!

But I digress. The point is this: delaying feature investments good, sunk cost fallacy bad. Any questions?

Which reminds me: I want to start and publish a list of people that get it.

Temperature Zone script

I quickly wrote a little script to strip out the temperature reading from files in /proc/acpi/thermal_zone. Clearly this is just a quick little hack and there must be a better way (HAL?), but it works for me.

Maybe I’ll roll my own little gnome panel applet that displays cute little icons and things.

#!/usr/bin/env python

def tempFromFile(filename):
  l = file(filename).read()
  try:
      return l.replace(" ", "").strip().split(":")[1]
  except IndexError:
      return "Unknown"

zones = (
  ("CPU", "/proc/acpi/thermal_zone/TZ1/temperature"),
  ("Video", "/proc/acpi/thermal_zone/TZ2/temperature"),
  ("Harddrive", "/proc/acpi/thermal_zone/TZ3/temperature")
)

for zone in zones:
  print zone[0] + ": " + tempFromFile(zone[1])

eval() considered harmful

Before I started writing this post, I googled around a bit thinking that someone would have blogged about this before. I was surprised to only find articles about how great and powerful the JavaScript function eval() can be. I’m sure there are some uses for eval(), but I have never seen it being used where there wasn’t a cleaner, better way to do the same thing.

I found a fairly good article here that covers eval(), but I don’t think it emphasizes enough that there is probably a better way of doing things in most (almost all) cases.

The reason I started writing this entry was because I’m currently in the process of making a very large web-app that literally has tens of thousands of lines of javascript code Firefox compatible. Before I could replace something like document.all.([a-zA-Z_0-9#]*) with document.getElementById(”$1?) I had to fix up the hundreds of lines that unnecessarily uses eval(), because often document.all is used inside eval and my regular expression would have failed miserably. That’s my main concern - eval() is one of those things that makes it more difficult to refactor your code. I’m also fairly sure that most of the times it was used it had a little performance impact on the code (not like it really mattered - none of the javascript was in a tight loop) because I’m sure it is slower than the conventional methods for most operations. People also committed the sin of building up a string that represents the path to the element they want to use and then run eval() on the same string probably 20 times in the same function..

So, to summarize: eval() is usually not necessary, often adds unnecessary complexity to your code , usually is a sign of bad quality code in general and almost definitely makes it more difficult to maintain.

btw, the example mentioned at the link above:

var fieldList = new Array('field1','field2','field3','field4','field5','field6',
    'field7','field8');

var tempObj;

for (count=0;count<fieldList.length;count++)
{
    tempObj=EVAL("document.myForm." + fieldList[i]);
    if (tempObj.value.length==0)
    {
        alert(fieldList[i] + " requires a value");
    }
}

can be done without eval() like this:

var fieldList = new Array('field1','field2','field3','field4','field5','field6',
    'field7','field8');

var tempObj;

for (count=0;count<fieldList.length;count++) {
    tempObj=document.myForm.elements[fieldList[i]];
    if (tempObj.value.length==0)
        alert(fieldList[i] + " requires a value");
}

UPDATE:

I have noticed two good uses of eval: JSON and AJAX. When you serialize something as JSON on the server, the easiest and quickest way to turn that into javascript objects is to eval() the string. Also, if you return javascript code blocks inside AJAX responses, then you have to eval() it for it to execute. Use with caution..