The Right Glue
June 2009
By Dean
Some things I have seen in the last couple of weeks have made me want to talk to you about my views regarding anonymity in general and this blog's comment system specifically.
Let's get this out of the way first: no, I am not just talking about 4chan. I'm talking about account-creation, moderation, commenting and other general stuff like that. Yes, I have advocated anonymity before and will certainly do so again today.
It all started when someone on Stack Overflow brought up the question How can I moderate trolls algorithmically? My answer was, of course, you can't. There is no way to detect a benevolent human from a malevolent human, at least with today's technology. The best we can do is detect non-human spammers based on their predictable behaviour. Malevolent humans will pass any Turing test by definition.
Therefore, moderation has to be something a human does. That means you have to hire moderators (or find volunteers) who spend significant portions of the day keeping discussions on track, removing spam and content that doesn't fit into the site's scope. This last part is largely subjective and, left in the hands of a single moderator, is largely undemocratic. That's why I advocate either no moderation whatsoever, or a system where users decide what to do with offending content (and moderators just execute the steps the users lay out for them).
As a libertarian I fundamentally believe in the freedom of expression, which is why I won't moderate comments on my site unless they are spam. But at the same time I believe every site should govern its own community in any way that it wants. I simply won't use most communities that have authoritarian moderation policies.
What I found most disturbing about the Stack-Overflow question was an answer advocating the abolishment of anonymity by linking user accounts with actual people, across the entire Internet. Kind of like an ID card you take with you to every site you visit, so that if you make a mistake on one site, everyone will know about it on every other site. The idea, put forward by ReadWriteWeb, is that malevolent Internet users (called "trolls") are only thus because there is no connection between their online personas and their offline ones. If there is no direct personal consequence of an action, trolls will take the action.
I disagree with this idea. ReadWriteWeb is confusing separation of person and persona with anonymity. I agree that separation of person and persona is a bad thing. It usually causes people to create astronaut personas in order to gain as much attention as possible. ReadWriteWeb says that if an actual person is attached to the persona, it will be harder to gain attention illegitimately. To me, it's more likely that people will simply create fake persons in order to back their personas. In fact from my point of view this is already being done all over the Internet.
ReadWriteWeb talks of requiring accounts on every site so that individuals can be tracked and followed wherever they visit. In order to post something to a site, they would require an account be created. I consider accounts to be a terrible thing. Accounts are a barrier to entry and instead of leaving an insightful comment, I'm more inclined to go look at something else if I need to sign up to leave it. Similarly, if I can't access a website's content because I need an account, I'm more likely to not access the content at all rather than sign up.
This is what I had in mind when I set up this site's comment system. There are no accounts, no signups. If you have something to say you can say it without any barriers other than your own typing ability. Good ideas need not be hindered by password prompts and account-profile maintenance.
What does deal with the attention-grasping problems of the Internet is anonymity. True anonymity: no barriers to accessing or adding content to anyone; no way to trace content back to its author. The idea, advocated by image boards, is that authors can't be tracked. Since they have no accounts or names, they can't create astronaut personas, which makes it impossible to gather attention in any long-term way. The best you can do is gather short-term attention, and the best way to do that is to be insightful or funny.
Likewise, I accomplish the same idea in my comment system by not requiring names, and by not leaving tracking cookies of any kind on users' PCs. There is nothing linking comments together other than a name, which can be faked or changed at any time. I also don't store IP addresses in anything other than my Apache logs, but you've only my word on that one.
I realize the irony of what I'm advocating since I'm doing it non-anonymously, but even if I were posting anonymously dedicated readers would still be able to determine who the person behind the site is through IP address tracking and whatnot. If you can't be 100% anonymous then the next best thing is to remove the separation of person and persona. I could be fake, but you'll just have to trust me.
Latest comments:
By Dean
Invariably, people on Stack Overflow ask questions about how to parse XML, HTML and their ugly daughter XHTML. Ignoring the most obvious solution to their problem (which would be to use a pre-existing XML parser), they think they should use regular expressions (regex for short). Now they have two problems, to quote the famous anti-regex saying.
Now, don't get me wrong. I like regular expressions. Parsing, formal languages, finite automata; these are among my favourite things. Regular expressions course through my veins. That's why it hurts me so when people try to misapply them and then emit quotes like the one in the previous paragraph.
Regular expressions were not designed to apply in every situation. They are not even remotely close to a universal parser (that would be a Turing machine). For some reason, inexperienced programmers equate the concept of parsing things with using regular expressions. Not only is that just plain misinformed, it's often more work to bend an algorithm to use regular expressions in the wrong situation than it is to do something more directly.
Determining if regex is right for your situation is simple. Figure out the structure of things you want to match. The list of all words you want to match is called a language. When you define a regular expression, you're defining a language. Since the language is defined by a regular expression, the language is called "regular". Regular languages are rather limited. They can't guarantee things like reversal (in other words, no palindromes) or simple counting (no strings with the same number of 'A's as 'B's). With enough experience, you can easily get a feel for what languages can and cannot be expressed using regular expressions.
Until that level of experience is attained, there are formal ways of proving languages are regular. The easiest way to prove a language is regular is by coming up with a regular expression that defines it perfectly. Often, easiest way to prove a language is not regular is using something called the pumping lemma, which is sort of a strategy for picking words that are unlikely to be regular.
Now, if you'll humour me, I'm going to prove that XML is nonregular:
XML is very well defined, and I won't go into its (very tedious) detail here, but suffice it to say that a simple XML document is of the form <word>donuts</word>, as long as both "word" parts are the same.

Informally, the pumping lemma tells us we need to come up with a valid XML document with certain specific properties. In this case, I'll be using the document <an>donuts</an>. The n in this example is the one required by the pumping lemma. This is a valid XML document for any value of n greater than or equal to 1.

Now, the next step is to note that the first n characters in the XML document contain at least one character a whenever n is two or greater.

So, it's easy to see that if you were to take any subset of the first n characters from the document and repeat them, you have a number of a characters other than n. But this repetition did not effect the </an> part of the document. So the document is no longer valid XML.

If XML were a regular language, the document would still be valid. Therefore, XML must not be a regular language.
Now, I recognize this proof might be over the heads of some people and obviously under the heads of those who are as familiar with the pumping lemma as I am (I leave filling in the edge cases of the proof as an exercise to the reader), but the bottom line is XML is not regular. Therefore, regular expressions are unsuited to parsing XML, since they cannot clearly be used to define an XML document. That's why, whenever anyone tries to parse XML with regular expressions, the expressions become unwieldy and complicated the more they are refined. Misapplying regular expressions is what causes many to believe that regular expressions themselves are useless or overcomplicated.
So, I ask you, please stop trying to parse XML (or any other non-regular language) with regular expressions. It's a bad idea, and I can prove it formally.
I recognize that XML stills needs to be parsed! Luckily most languages have built-in libraries that parse XML for you without you having to worry about the details: .NET has System.Xml, Java has javax.xml, etc. There are even languages specifically designed to parse XML quickly and reliably, like XPath, XSL and XQuery. These tools are all very easy to use and let you avoid writing overcomplicated regular expressions that won't even work in every situation anyway.
Use regular expressions only on regular languages. When in doubt, if you're finding your regex is getting convoluted, chances are you should be using something else.
Latest comments:
This is some kind of footnote. This webpage is awesome and can be viewed in any browser. Even ones that suck ass like Safari and Firefox. Isn't that awesome? This site is best viewed with browsers that aren't maximized on large-resolution displays (> 1024 pixels in width). But then again, if you are running a large resolution and browsing maximized, then you're a terrible person so you don't really deserve to see this site at its finest. Jerk. I mean, seriously. I spend all this time making a nice site and your silly browsing habits ruin its look. That's really cold, man. If you're using IE6, then in order to see the cool avatar effects you need to enable JavaScript. No rights reserved by Dean Whelton (who is awesome) of any of the content, images, design, backend or electrons used in this site. Steal at your convenience. None of it is worth stealing anyway, so there. I have even made an RSS feed for more efficient theft of my intellectual property: CLICK IT NOW!!! Now, don't say I'm not generous. I guess if you want to know more about me, you can visit the about page. It's not really an about page, though. It's just one of the first posts. I don't feel like making a real about page. You can contact me, too. If you feel like it. Are you really wasting time reading this? Go outside or something.