Learning to Perform Moderation in Online Forums
Sponsored by:
National Science Foundation,
Division of Information & Intelligent Systems
Principal Investigator: Shlomo Zilberstein
Research Assistants: Andrew Arnt, Marek Petrik, Martin Allen
Project Description
This project is aimed at automating the moderation of Internet discussion forums. Discussion forums are web applications that allow users to post messages and hold discussions. They offer a valuable resource for people looking to find information, discuss ideas, and get advice on the Internet.
The number of forums continues to grow rapidly, covering such topics as politics, technical news and advice, medical issues, and product ratings and opinions. Consider for example Slashdot.org, a popular website featuring science and technology news that serves over 30 million pages per month. The editors of Slashdot post about a dozen stories a day. These stories can include such items as news articles, book and movie reviews, interviews with people from the tech world, announcements of new software releases, and so on. Each story serves as a topic. After a story is posted, readers of the site are encouraged to add their own comments on the story, which are appended to the bottom of the page for future readers to observe and possibly reply to. As the site has grown in popularity, the number of comments made on each story has grown as well. Now topics average several hundred comments each, as opposed to just a few dozen a few years ago. Furthermore, as the number of comments grew, the fraction of what could be classified as high value comments decreased. In order to combat this phenomenon and make the comments more readable, the editors of Slashdot developed a system of moderation.
In the current Slashdot moderation system, each logged-in user is occasionally and randomly issued a small number of "moderation points". With each point, the reader/moderator can take away or add a point to the moderation score of a single comment. The moderation score of a comment can range from -1 to 5. A user interface was developed so that a threshold score can be specified by the reader such that all comments with moderation scores less than that threshold are not displayed. This system has the advantage that while user content is never deleted, it is still possible to set the threshold so as to filter out comments that have been negatively moderated.
In our study, the value of a comment is based on the final moderation score it achieves. Those comments with scores of -1 were classified as low value, those with scores between 0 and 2 were classified undetermined, and those with scores between 3 and 5 were labeled as high value. The reason for this division of the space of scores is that in the Slashdot moderation system, -1 scores can only be achieved by negative moderation or if the poster is a known abuser of the system. Anonymous users' posts start with a default score of 0, logged-in users' posts start at 1, and users who have made many high score comments in the past can occasionally start their posts with a score of 2. Therefore, the label of undetermined reflects the fact that very little or perhaps mixed moderation has been done to the comment, while a score of 3 or more can only be achieved by positive moderation of the comment.
The main motivation for this project is that the moderation systems in many active forums suffer from a sparsity problem. Our study has shown that at peak levels of moderation, when a story has just been posted for discussion, just 60 percent of all comments posted will ever receive any kind of moderation. By the time the story is two hours old, fewer than 20 percent of comments are being moderated.
The main goal of this project has been to develop a framework for learning to perform machine moderation by finding patterns in the moderations made by humans. We have broken this overall goal into several key research questions:
- Feature identification and extraction:
What are the features that define a good or bad comment? How could these features be extracted automatically and efficiently? What impact do these features have on the overall quality of the moderation? - High-precision comment classification:
How can we build a probabilistic model to predict with high accuracy the moderation that should be assigned to an arbitrary comment? How can a classifier be trained efficiently to improve its prediction based on examples? - Real-time operation:
How can we address the computational complexity of the classifier and make it suitable for real-time operation? - Generalizability and knowledge transfer:
How does a classifier trained in one forum perform in a different (possibly unmoderated) forum? Could the knowledge acquired through training be applied to different but similar forums? Could it be applied to new forums that have a somewhat different organizational structure or discussion topics?
Related Publications
- Learning Parallel Portfolios of Algorithms.
M. Petrik and S. Zilberstein. Annals of Mathematics and Artificial Intelligence, 48(1-2):85-106, 2006. [abs] [bib] [pdf]
- Web Page Clustering using Heuristic Search in the Web Graph.
R. Bekkerman, S. Zilberstein, and J. Allan. Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI), 2280-2285, Hyderabad, India, 2007. [abs] [bib] [pdf]
- Learning Static Parallel Portfolios of Algorithms.
M. Petrik and S. Zilberstein. Proceedings of the Ninth International Symposium on Artificial Intelligence and Mathematics (ISAIM), Ft. Lauderdale, Florida, 2006. [abs] [bib] [pdf]
- Learning Policies for Sequential Time and Cost Sensitive Classification.
A. Arnt and S. Zilberstein. KDD Workshop on Utility-Based Data Mining, Chicago, Illinois, 2005. [abs] [bib] [pdf]
- Generating Admissible Heuristics by Abstraction for Search in Stochastic Domains.
N. Beliaeva and S. Zilberstein. Proceedings of the Symposium on Abstraction, Reformulation, and Approximation (SARA), 14-29, Airth Castle, Scotland, 2005. [abs] [bib] [pdf]
- Attribute Measurement Policies for Cost-Effective Classification.
A. Arnt and S. Zilberstein. SIAM/SDM Workshop on Data Mining in Resource Constrained Environments, Lake Buena Vista, Florida, 2004. [abs] [bib] [pdf]
- Attribute Measurement Policies for Cost-Effective Classification.
A. Arnt and S. Zilberstein. Proceedings of the Fourth IEEE International Conference on Data Mining, Brighton, UK, 2004. [abs] [bib] [pdf]
- Learning to Perform Moderation in Online Forums.
A. Arnt and S. Zilberstein. Proceedings of Web Intelligence, 637-641, Acapulco, Mexico, 2003. [abs] [bib] [pdf]