We’ve been having trouble with spam on some Dynamo sites, notably e-vent.co.za which has been around in some form for many years and therefore was already well known by spammers when we launched it.
I’ve tried different things to deal with the spam over the months and I thought I might as well post them here so that other people might benefit from it.
Abort on unknown form fields
The first thing I did was to make sure only the form fields that were actually on the form got posted. This is to block the most basic (and abundant) spam bots that just seem to broadcast every possible fieldname. This has the advantage of being cheap and easy to do and it is probably a good idea for security in general. Also, no human being filling in the form can possibly add extra fields. The only way things can go wrong is if the template coder makes a mistake and accidentally adds a field with an unknown name.
Add some honeypot fields to forms
I added some extra hidden form fields to catch those scripts that just fill in every field. Basically if it is filled in, I know it was filled in by a bot. Most of them are smart enough to ignore hidden fields, though and you’ll get better mileage if you make it a normal field. This is ofcourse a usability issue and you’ll have to add an ugly label that says something like “Leave this field blank”. You can try putting it in a div with style=”display: none”, but then smart scripts can probably figure it out and your HTML is still polluted with useless stuff which annoys standards freaks like me.
Check comment syntax
We already don’t allow most HTML tags and we strip these out when we process the comment. We run some kind of defacto-standard filters on the body of the comment to turn things into paragraphs and to make links clickable. I noticed that many bots just send a whole lot of links once for every type of syntax that is commonly used on blogs and forums. This includes the bbcode style [url] tags and also <a> tags. In stead of just stripping them out, I now drop the user back at the form with an error message similar to any other server-side validation error message saying something like “This comment contains tags that are not allowed.”
Therefore it is not even classified as spam, because many humans might make the same mistake. The user can just change his comment and submit it again whereas a spam bot is not likely to try. This has the added bonus that comments end up looking better on average, because link tags don’t just get stripped.
Compare the number of links vs the number of paragraphs
I noticed that most humans will only include one or two links for every paragraph of text. People very rarely paste 10 or 20 links straight after each other, but spammers seem to think this is normal. So, after the comment’s been converted to HTML, I count the paragraphs and the number of links. Then I work out a ratio for the number of links per paragraph and if it exceeds a certain number, I assume it is spam, drop the user back at the form and then he can change it before submitting again. A spam bot is not likely to bother ;) I don’t only allow a fixed number of links, because I feel that if a human takes the time to write a long response and references a lot of things, then that’s obviously valid.
I know this is the step that can most easily block real comments, but I feel it is necessary, because the worst spam comments are the ones that contain half a screen full of links. A user can always edit his comment and try again.
Use Bayesian filtering methods
Lately spam bots got smarter and you see more and more comments that contain one paragraph of text that reads something like “Cool site. Thank you:-)” followed by a link or comments that only consist of one link with the user’s name being the spammy keywords. This made it through my filters, so I had to look at something a bit more complicated.
I went looking for some Python-based Bayesian code and found DivMod’s excellent Reverend. I quickly did some tests using the spams that I logged and the legitimate comments in the database and I am very very impressed with the way it works.
I now have a “ham/spam Bayesian db” file which I load whenever the app’s processes start up and I check comments against that. I have a separate process to periodically update this db with the latest spams and hams. (This is basically so that I don’t suffer the performance penalty of reading the database on every comment and also so that I don’t run into issues with multiple processes loading and updating the same file).
I haven’t rolled this out to the live server yet, because the changes are tied up with other new work, but it looks like it will solve my spam problems completely. (for now) It was a lot easier to do than I thought and I wish I implemented it a long time ago.
Ideas I haven’t tried
You can check the HTTP referer to see if it matches your comment form’s URL, but I suspect most spam bots behave a lot like humans nowadays and will populate this correctly.
You can do an extra keyword check on the commenter’s name. Recent bots started putting their keywords here.
You can always require captchas or even site-specific passwords that you publish near the comment form. This is annoying to frequent commenters, though and apparently some smart bots are quite good at dealing with captcha images. It will probably put extra unnecessary load on your server to generate the images too. I think it is a messy solution.
I thought of implementing a double opt-in for comments with scores close to the cutoff point. Basically you send the commenter an email with a link he has to visit in order to activate the comment or maybe his entire session. This requires that you track extra state, though and further complicates things, but I’m fairly sure that few bots would receive the email and visit the link. You can combine this with setting something in the user’s session even.
Kindof similar to the method above, but in stead of doing a double opt in you present the user with an extra step after commenting which is a captcha form. Very intelligent spam bots might handle the captcha on the initial comment form, but a second form will probably confuse them. Most users will never see this and if you’re smart they will only see this once at most. It doesn’t require that you store extra state with the comment (is_approved) and it means that human beings would be allowed to moderate their own false positives without changing their comments.
You can implement an automatic whitelist. Once a person added multiple successful comments you can become more lenient in what you allow from him. There are various ways this can be implemented. See this article for an example of how it can be implemented for email.
Leave a Response