I have been experimenting with comment settings in this blog to observe how willing spammers leave undesirable footprints aka spam comments on the blogs. Someone might have said that spamming is as pervasive as the web itself and it grows as the web evolves and reaches more audience. Still, having a hindsight experience in analyzing spam comment trending and behavior will be beneficial for me and also other website owners and bloggers especially when they are considering a loose commenting policy on their website.
Interaction in social web sites
As the web evolves to be more social, we can also observe higher level of interactivity on the web. Not only quicker page load, better content presentation, and improved system interface is interactivity characterized with, it also includes a simplified way of interaction among the website or web application users. In a bigger picture, interactivity encompasses user-application interaction and user-user interaction. If we talk about user-user interaction, we may think about creating such a user friendly platform for them to interact and communicate. Specific to blogging, user-user interaction can take form as commenting on the posts and also other user’s comment.
Now, here the question comes. If we can provide our users a friendly interface for interaction even for first-timers, can we expect for more quality interaction? And my personal answer to this is it depends on how good we are in implementing the website growth strategy. The first thing on which attention should be put thoroughly is the content. “Content is the king” is an undisclosed truth that tops the strategies towards website growth. I have been analyzing the growth of this technology blog by comparing number of posts in a certain month with the incoming traffic (user visits) for the corresponding month. I found two cases where traffic was higher than the average. The first case is the month with quality post(s) that drew users’ attention regardless of the total number of posts in that month. For the second case, it is the month with more posts, regardless of the quality, compared to the average monthly number of posts.
It is interesting to analyze the second case of the observation. The finding flattered me to think that incoming traffic is merely a number game. The more we put something, the more people will come to see what is happening. Yet, I haven’t verified the presumption that even though I fill all the content landscapes with junk, people will keep coming. There should be another experiment conducted to prove the hypothesis of the number game. Nevertheless, I have left it unanswered until now since I am more interested in showing something good to people instead of tainting my backyard with trashes.
Having found the cases of attracting users, it is urgent to ask who the users are and what impact they bring onto the table. Since we are aiming at quality interaction, we will target real users with interest or experience related with the contents posted and not anything else. In the blogging scenario, this is translated as constructive comments from real users that eventually contribute to the improvement of the overall quality of the contents. It is necessary to get rid of unsolicited irrelevant comments from some users in order to maintain the comments to be on-track with the topic of the content.
It is questionable if adding something irrelevant is of users’ interests. However irrelevant a comment is, there should be some motives of doing so. Earlier, I found that incentives trigger user participation. A random user who is accidentally directed to certain content section may get little incentive from posting a comment about the content if reading it does not help him/her solve a problem. Nonetheless, a helpful content may only attract fewer comments if it is provided as a set of flat factual exposition instead of mind-challenging, open-end, or even controversial argumentation. Controversial contents draw users’ attention by nature. Such contents also provide users more incentive to post comment as they can expect more argument to their standpoint from other users besides other perspectives that do not intersect with their point of view.
The prevalence of zombie users
Surprisingly, I found a class of users that seek for different incentive for their interaction. They are arguably real users since the way they start the interaction is a little bit awkward and uncompelling. I call this class of users as zombie users since their existence haunts the site owners and their spam comments incur extra work to be done in order to sterilize the content from their traces. Besides irrelevant, cryptic, or generic statement, they usually put external link bait in the comment. The motive is less than obvious. External links can contain anything good or bad that another reader can expect, which unfortunately is bad most of the time. Name it unsolicited marketing, phishing, virus, backlink creation, SEO, or other hidden purposes. If this kind of interaction is authorized by the website or blog owners, they practically allow degradation of the content quality consciously. Even though allowing the display of such comments may create a temporary impression of an actively discussed content, in the long term it backfires the owner due to harm brought by the external links and sheer irrelevance between the content and user interaction.
How persistent are these zombie users in teasing blog owners with their spamming attempts? I conducted an experiment to see the frequency of spamming attempt with different commenting policy. In the following figures, I show how a different commenting policy can result in variance of comment spamming attempts.
This figure shows an immediate spike of spam comments after a policy change from human-challenge test to automatic comment approval. The spike is marked by the blue circle. In the picture, the plots before the spike reflect the less successful comment spamming attempts to the system. This is since the human-challenge successfully prevented majority of automated spam comments from being submitted to the system. The human challenge test embedded in the comment form was a simple mathematical calculation. The calculation was considerably easy enough to be solved by a fifth-grader but difficult to solve by spam bots with not so sophisticated AI.
We can also see from the picture that as soon as automated comment approval was implemented, number of spam comments immediately surged, even faster than expected. Relating this phenomenon with user interactivity, this is the trade-off between the expectation of increased user interactivity and uninvited participation from lurking spam bots. The immediate surge also hints that automated spam attack is impending. Given a loose policy that facilitates the infiltration of spam comments to a website, the tendency of spam comment inflow to the website goes higher than anticipated.
Despite the increasing spam comment inflow, the comment would not be automatically displayed. One nice feature of Akismet, the spam filtering plugin used on this tech blog, is its ability to hold suspicious comments from being published. During the implementation of the loose comment policy, pending comments grew tremendously compared to the previous period of tighter comment policy. Manual moderation should be done to handpick false-positive comment and discard others in the list. Still, I found out the majority of the pending comments was spam.
By looking at the figure above, we can see the spam comment trending after the comment approval policy was restored back to normal that is by re-enabling human-challenge test for verification. However, to compare the impact of various human-challenge tests, CAPTCHA image was used to filter comments from zombie users. It is suffice to say that the policy change immediately took into effect as number of caught spam comments decreased immediately. There were only two local maximum points spotted after CAPTCHA was implemented. The points are around a half of the maximum recorded during the period of loose comment policy thus showing the effectiveness of the policy change.
Surprisingly, exciting phenomena were observed in the extended period of the CAPTCHA implementation. In the graph depicted above, the whole duration of the observation of spam comment trending is depicted. The graph is divided into three sections. The first section is duration for math-based human-challenge verification. The second section, which is the section bounded by the vertical blue lines, is the duration for no verification. Finally, the rightmost section is the duration of CAPTCHA-based human-challenge verification. The horizontal green line is the maximum number of spam comments during the math-based human-challenge period. Earlier, we have discussed the effect of CAPTCHA in lessening number of spam comments. Yet, the finding in this picture may advise something else. The pink circles are the days when spam comments in the period of CAPTCHA implementation outscored the maximum value in the math-based human-challenge period. Among the three circles, two have values bigger than the maximum value in the no-verification period. Here the question is raised, is CAPTCHA really effective to avert the zombie users from posting spam comments?
To answer this question, I rechecked the number of pending comments and the CAPTCHA configuration. The CAPTCHA configuration was set to easy. With the advance in image recognition, a spam bot equipped with sufficient AI can decode the text drawn in the CAPTCHA image. Hence, the chance of passing the human-challenge test by non-human zombie users is not negligible. Following the successful spam comment posting, the comment will be rechecked by Akismet. It will be marked as pending if the content of the comment is suspicious. Different with my expectation, the number of pending comments listed was smaller compared to those recorded during the period of loose commenting policy. But why higher number of spam comments was detected? My hypothesis is most of the comments that passed the image verification were detected as spam by Akismet and discarded immediately.
Having experimented with three different commenting policies, I got better understanding in predicting user behavior, both human and zombie. The loose commenting policy is so far the worst commenting policy in terms of discouraging irrelevant comment posting. This policy attracts automated comment posting and inflicts extra spam cleaning work to be exerted. Stricter commenting policies, both math-based and CAPTCHA-based are good in lessening spamming attempts. In the past, math-based human-challenge test was effective in suppressing the spam attempts yet it also discouraged real users to solve the math calculation and post the comments. Temporary result shows that CAPTCHA still attracts spam bots to come and break the challenge. However, more observation and analysis should be done to measure the effectiveness of this approach in the long run.
Spam Filtering Techniques
There are several techniques applicable to stir spam away from the website. According to Heyman et al., anti-spam strategies can be classified into three groups: identification-based (detection), rank-based (demotion) and interface or limit-based (prevention). The samples of schemes pertaining to each group can be seen in the following picture.
Identification-based processes spam into two phases: detection of spam-like objects and removal of objects identified of spam. Spam identification can be achieved by either source analysis, text analysis, or link or behavior analysis. Akismet is an example of an anti-spam solution that falls into this category.
Rank-based anti-spam strategy tries to combat spam by ranking the object and showing only smaller portion of the results. The lower the rank of an object in the list, its possibility to be a spam is bigger. However, rank-based strategy may not fit to the commenting in the blog realm as it is sometimes difficult to decide the metrics of a comment with higher rank unless human-aided assistance such as comment rank by other users or like-button is also provided.
CAPTCHA is an example of interface-based or limit-based anti-spam strategy. This strategy primarily tries to increase the difficulty of producing and sending spam objects through modification of the interface used for interaction between user and the system. In the interface-based strategy, the modification commonly takes form in the use of partial secrecy or limitation of automated interaction. Limit-based strategy uses different approach to combat the spam. This strategy emphasizes on limiting user action thus hindering the generation of spam. An example of limit-based strategy is content flood control (new content from the same user is allowed to be posted after certain interval) and content poster control (content can only be posted by registered users).
In the end, it will depend on the website or blog owner to implement the most appropriate commenting policy. Too strict policy will withdraw users from being involved in the dynamic interaction facilitated by the website. Loose policy, on the other hand, incurs some extra cost to pay, which is increasing volume of irrelevant interaction and the cleaning work that has to be done. A due analysis on the dynamics of the website may help the owners to plan the proper strategy that is both friendly for its users and also safe from unwanted intruders.
P. Heyman, G. Koutrika, H. G.-Molina, Fighting Spam on Social Websites: A Survey of Approaches and Future Challenges, IEEE Internet Computing 11(6):36-45, 2007. http://ilpubs.stanford.edu:8090/818/1/2007-34.pdf