Proceedings

Proceedings of the 4th international workshop on Adversarial Information Retrieval on the Web, Beijing, China, April 22nd, 2008. ISBN 978-1-60558-159-0, ACM ICPS.

Session: Usage Analysis

A Large-scale Study of Automated Web Search Trafficslides
Greg Buehrer, Jack Stokes and Kumar Chellapilla

Pages 1-8, doi.acm.org/10.1145/1451983.1451985 As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a website's rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by bots. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. We believe these features formulate a basis for a production-level query stream classifier.

Identifying Web Spam with User Behavior Analysisslides
Yiqun Liu, Rongwei Cen, Min Zhang, Liyun Ru and Shaoping Ma

Pages 9-16, doi.acm.org/10.1145/1451983.1451986 Combating Web spam has become one of the top challenges for Web search engines. State-of-the-art spam detection techniques are usually designed for specific known types of Web spam and are incapable and inefficient for newly-appeared spam. With user behavior analyses into Web access logs, we propose a spam page detection algorithm based on Bayesian Learning. The main contributions of our work are: (1) User visiting patterns of spam pages are studied and three user behavior features are proposed to separate Web spam from ordinary ones. (2) A novel spam detection framework is proposed that can detect unknown spam types and newly-appeared spam with the help of user behavior analysis. Preliminary experiments on large scale Web access log data (containing over 2.74 billion user clicks) show the effectiveness of the proposed features and detection framework.

Query-log mining for detecting spamslides
Carlos Castillo, Claudio Corsi, Debora Donato, Paolo Ferragina and Aristides Gionis

Pages 17-20, doi.acm.org/10.1145/1451983.1451987 Every day millions of users search for information on the web via search engines, and provide implicit feedback to the results shown for their queries by clicking or not onto them. This feedback is encoded in the form of a query log that consists of a sequence of search actions, one per user query, each describing the following information: (i) terms composing a query, (ii) documents returned by the search engine, (iii) documents that have been clicked, (iv) the rank of those documents in the list of results, (v) date and time of the search action/click, (vi) an anonymous identifier for each session, and more.

In this work, we investigate the idea of characterizing the documents and the queries belonging to a given query log with the goal of improving algorithms for detecting spam, both at the document level and at the query level.

Session: Text Analysis

Cleaning Search Results using Term Distance Features slides
Josh Attenberg and Torsten Suel

Pages 21-24, doi.acm.org/10.1145/1451983.1451989 The presence of Web spam in query results is one of the critical challenges facing search engines today. While search engines try to combat the impact of spam pages on their results, the incentive for spammers to use increasingly sophisticated techniques has never been higher, since the commercial success of a Web page is strongly correlated to the number of views that page receives. This paper describes a term-based technique for spam detection based on a simple new summary data structure called Term Distance Histograms that tries to capture the topical structure of a page. We apply this technique as a post-filtering step to a major search engine. Our experiments show that we are able to detect many of the artificially generated spam pages that remained in the results of the engine. Specifically, our method is able to detect many web pages generated by utilizing techniques such as dumping, weaving, or phrase stitching [11], which are spamming techniques designed to achieve high rankings while still exhibiting many of the individual word frequency (and even bi-gram) properties of natural human text.

Exploring Linguistic Features for Web Spam Detection: A Preliminary Study slides
Jakub Piskorski, Marcin Sydow and Dawid Weiss

Pages 25-28, doi.acm.org/10.1145/1451983.1451990 We study the usability of linguistic features in the Web spam classification task. The features were computed on two Web spam corpora: Webspam-Uk2006 and Webspam-Uk2007, we make them publicly available for other researchers. Preliminary analysis seems to indicate that certain linguistic features may be useful for the spam-detection task when combined with features studied elsewhere.

Latent Dirichlet Allocation in Web Spam Filtering slides
Istvan Biro, Jacint Szabo and Andras Benczur

Pages 29-32, doi.acm.org/10.1145/1451983.1451991 Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply a modification of LDA, the novel multi-corpus LDA technique for web spam classification. We create a bag-of-words document for every Web site and run LDA both on the corpus of sites labeled as spam and as non-spam. In this way collections of spam and non-spam topics are created in the training phase. In the test phase we take the union of these collections, and an unseen site is deemed spam if its total spam topic probability is above a threshold. As far as we know, this is the first web retrieval application of LDA. We test this method on the UK2007-WEBSPAM corpus, and reach a relative improvement of 11% in F-measure by a logistic regression based combination with strong link and content baseline classifiers.

Session: General

Analysing Features of Japanese Splogs and Characteristics of Keywordsslides
Yuuki Sato, Takehito Utsuro, Tomohiro Fukuhara, Yasuhide Kawada, Yoshiaki Murakami, Hiroshi Nakagawa and Noriko Kando

Pages 33-40, doi.acm.org/10.1145/1451983.1451993 This paper focuses on analyzing (Japanese) splogs based on various characteristics of keywords contained in them. We estimate the behavior of spammers when creating splogs from other sources by analyzing the characteristics of keywords contained in splogs. Since splogs often cause noises in word occurrence statistics in the blogosphere, we assume that we can efficiently (manually) collect splogs by sampling blog homepages containing keywords of a certain type on the date with its most frequent occurrence. We manually examine various features of collected blog homepages regarding whether their text content is excerpt from other sources or not, as well as whether they display affiliate advertisement or out-going links to affiliated sites. Among various informative results, it is important to note that more than half of the collected splogs are created by a very small number of spammers.

Webspam Identification Through Content and Hyperlinksslides
Jacob Abernethy, Olivier Chapelle and Carlos Castillo

Pages 41-44, doi.acm.org/10.1145/1451983.1451994 We present an algorithm, witch, that learns to detect spam hosts or pages on the Web. Unlike most other approaches, it simultaneously exploits the structure of the Web graph as well as page contents and features. The method is efficient, scalable, and provides state-of-the-art accuracy on a standard Web spam benchmark.

Session: Social Networks

Identifying Video Spammers in Online Social Networks slides
Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida, Chao Zhang and Keith Ross

Pages 45-52, doi.acm.org/10.1145/1451983.1451996 In many video social networks, including YouTube, users are permitted to post video responses to other users' videos. Such a response can be legitimate or can be a video response spam, which is a video response whose content is not related to the topic being discussed. Malicious users may post video response spam for several reasons, including increase the popularity of a video, marketing advertisements, distribute pornography, or simply pollute the system.

In this paper we consider the problem of detecting video spammers. We first construct a large test collection of YouTube users, and manually classify them as either legitimate users or spammers. We then devise a number of attributes of video users and their social behavior which could potentially be used to detect spammers. Employing these attributes, we apply machine learning to provide a heuristic for classifying an arbitrary video as either legitimate or spam. The machine learning algorithm is trained with our test collection. We then show that our approach succeeds at detecting much of the spam while only falsely classifying a small percentage of the legitimate videos as spam. Our results highlight the most important attributes for video response spam detection.

A Few Bad Votes Too Many? Towards Robust Ranking in Social Mediaslides
Eugene Agichtein, Jiang Bian, Yandong Liu, and Hongyuan Zha

Pages 53-60, doi.acm.org/10.1145/1451983.1451997 Online social media draws heavily on active reader participation, such as voting or rating of news stories, articles, or responses to a question. This user feedback is invaluable for ranking, filtering, and retrieving high quality content - tasks that are crucial with the explosive amount of social content on the web. Unfortunately, as social media moves into the mainstream and gains in popularity, the quality of the user feedback degrades. Some of this is due to noise, but, increasingly, a small fraction of malicious users are trying to "game the system" by selectively promoting or demoting content for profit, or fun. Hence, an effective ranking of social media content must be robust to noise in the user interactions, and in particular to vote spam. We describe a machine learning based ranking framework for social media that integrates user interactions and content relevance, and demonstrate its effec- tiveness for answer retrieval in a popular community question answering portal. We consider several vote spam attacks, and introduce a method of training our ranker to increase its robustness to some common forms of vote spam attacks. The results of our large-scale experimental evaluation show that our ranker is signifcicantly more robust to vote spam compared to a state-of-the-art baseline as well as the ranker not explicitly trained to handle malicious interactions.

The Anti-Social Tagger - Detecting Spam in Social Bookmarking Systems slides
Beate Krause, Christoph Schmitz, Andreas Hotho and Gerd Stumme

Pages 61-68, doi.acm.org/10.1145/1451983.1451998 The annotation of web sites in social bookmarking systems has become a popular way to manage and find information on the web. The community structure of such systems attracts spammers: recent post pages, popular pages or specific tag pages can be manipulated easily. As a result, searching or tracking recent posts does not deliver quality results annotated in the community, but rather unsolicited, often commercial, web sites. To retain the benefits of sharing one's web content, spam-fighting mechanisms that can face the flexible strategies of spammers need to be developed.

A classical approach in machine learning is to determine relevant features that describe the system's users, train different classifiers with the selected features and choose the one with the most promising evaluation results. In this paper we will transfer this approach to a social bookmarking setting to identify spammers. We will present features considering the topological, semantic and profile-based information which people make public when using the system. The dataset used is a snapshot of the social bookmarking system BibSonomy and was built over the course of several months when cleaning the system from spam. Based on our features, we will learn a large set of different classification models and compare their performance. Our results represent the groundwork for a first application in BibSonomy and for the building of more elaborate spam detection mechanisms.

Session: Link Analysis

Robust PageRank and Locally Computable Spam Detection Featuresslides
Vahab Mirrokni, Reid Andersen, Christian Borgs, Jennifer Chayes, John Hopcroft, Kamal Jain and Shang-Hua Teng

Pages 69-76, doi.acm.org/10.1145/1451983.1452000 Since the link structure of the web is an important element in ranking systems on search engines, web spammers widely use the link structure of the web to increase the rank of their pages. Various link-based features of web pages have been introduced and have proven effective at identifying link spam. One particularly successful family of features (as described in the SpamRank algorithm), is based on examining the sets of pages that contribute most to the PageRank of a given vertex, called supporting sets. In a recent paper, the current authors described an algorithm for efficiently computing, for a single specified vertex, an approximation of its supporting sets. In this paper, we describe several link-based spam-detection features, both supervised and unsupervised, that can be derived from these approximate supporting sets. In particular, we examine the size of a node's supporting sets and the approximate l2 norm of the PageRank contributions from other nodes. As a supervised feature, we examine the composition of a node's supporting sets. We perform experiments on two labeled real data sets to demonstrate the effectiveness of these features for spam detection, and demonstrate that these features can be computed efficiently. Furthermore, we design a variation of PageRank (called Robust PageRank) that incorporates some of these features into its ranking, argue that this variation is more robust against link spam engineering, and give an algorithm for approximating Robust PageRank.