Proceedings of the 5th international workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, April 21, 2009. The workshop proceedings were published as a volume in the ACM International Conference Proceedings with ISBN 978-1-60558-438-6.

Invited Talks

The Potential for Research and Development in Adversarial Information Retrieval — slides
Brian D. Davison

Web Spam Challenges: Looking Backward and Forward — slides
Carlos Castillo

Temporal Analysis

Looking into the Past to Better Classify Web Spamslides
Na Dai, Brian D. Davison and Xiaoguang Qi

Pages 1-8, Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.

A Study of Link Farm Distribution and Evolution Using a Time Series of Web Snapshotsslides
Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa

Pages 9-16, In this paper, we study the overall link-based spam structure and its evolution which would be helpful for the development of robust analysis tools and research for Web spamming as a social activity in the cyber space. First, we use strongly connected component (SCC) decomposition to separate many link farms from the largest SCC, so called the core. We show that denser link farms in the core can be extracted by node filtering and recursive application of SCC decomposition to the core. Surprisingly, we can find new large link farms during each iteration and this trend continues until at least 10 iterations. In addition, we measure the spamicity of such link farms. Next, the evolution of link farms is examined over two years. Results show that almost all large link farms do not grow anymore while some of them shrink, and many large link farms are created in one year.

Web Spam Filtering in Internet Archivesslides
Miklós Erdélyi, András A. Benczúr, Julien Masanes and Dávid Siklósi

Pages 17-20, While Web spam is targeted for the high commercial value of topranked search-engine results, Web archives observe quality deterioration and resource waste as a side effect. So far Web spam filtering technologies are rarely used by Web archivists but planned in the future as indicated in a survey with responses from more than 20 institutions worldwide. These archives typically operate on a modest level of budget that prohibits the operation of standalone Web spam filtering but collaborative efforts could lead to a high quality solution for them. In this paper we illustrate spam filtering needs, opportunities and blockers for Internet archives via analyzing several crawl snapshots and the difficulty of migrating filter models across different crawls via the example of the 13 .uk snapshots performed by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.

Content Analysis

Web Spam Identification Through Language Model Analysisslides
Juan Martinez-Romo and Lourdes Araujo

Pages 21-28, This paper applies a language model approach to different sources of information extracted from a Web page, in order to provide high quality indicators in the detection of Web Spam. Two pages linked by a hyperlink should be topically related, even though this were a weak contextual relation. For this reason we have analysed different sources of information of a Web page that belongs to the context of a link and we have applied Kullback-Leibler divergence on them for characterising the relationship between two linked pages. Moreover, we combine some of these sources of information in order to obtain richer language models. Given the different nature of internal and external links, in our study we also distinguished these types of links getting a significant improvement in classification tasks. The result is a system that improves the detection of Web Spam on two large and public datasets such as WEBSPAM-UK2006 and WEBSPAM-UK2007.

An Empirical Study on Selective Sampling in Active Learning for Splog Detectionslides
Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and Tomohiro Fukuhara

Pages 29-36, This paper studies how to reduce the amount of human supervision for identifying splogs / authentic blogs in the context of continuously updating splog data sets year by year. Following the previous works on active learning, against the task of splog / authentic blog detection, this paper empirically examines several strategies for selective sampling in active learning by Support Vector Machines (SVMs). As a confidence measure of SVMs learning, we employ the distance from the separating hyperplane to each test instance, which have been well studied in active learning for text classification. Unlike those results of applying active learning to text classification tasks, in the task of splog / authentic blog detection of this paper, it is not the case that adding least confident samples performs best.

Linked Latent Dirichlet Allocation in Web Spam Filteringslides
István Bíró, Dávid Siklósi, Jácint Szabó and András Benczúr

Pages 37-40, Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003) is a fully generative statistical language model on the content and topics of a corpus of documents. In this paper we apply an extension of LDA for web spam classification. Our linked LDA technique takes also linkage into account: topics are propagated along links in such a way that the linked document directly influences the words in the linking document. The inferred LDA model can be applied for classification as dimensionality reduction similarly to latent semantic indexing. We test linked LDA on the WEBSPAM-UK2007 corpus. By using BayesNet classifier, in terms of the AUC of classification, we achieve 3% improvement over plain LDA with BayesNet, and 8% over the public link features with C4.5. The addition of this method to a log-odds based combination of strong link and content baseline classifiers results in a 3% improvement in AUC. Our method even slightly improves over the best Web Spam Challenge 2008 result.

Social Spam

Social Spam Detectionslides
Benjamin Markines, Ciro Cattuto and Filippo Menczer

Pages 41-48, The popularity of social bookmarking sites has made them prime targets for spammers. Many of these systems require an administrator’s time and energy to manually filter or remove spam. Here we discuss the motivations of social spam, and present a study of automatic detection of spammers in a social tagging system. We identify and analyze six distinct features that address various properties of social spam, finding that each of these features provides for a helpful signal to discriminate spammers from legitimate users. These features are then used in various machine learning algorithms for classification, achieving over 98% accuracy in detecting social spammers with 2% false positives. These promising results provide a new baseline for future efforts on social spam. We make our dataset publicly available to the research community.

Tag Spam Creates Large Non-Giant Connected Componentsslides
Nicolas Neubauer, Robert Wetzker and Klaus Obermayer

Pages 49-52, Spammers in social bookmarking systems try to mimick bookmarking behaviour of real users to gain the attention of other users or search engines. Several methods have been proposed for the detection of such spam, including domain specific features (like URL terms) or similarity of users to previously identified spammers. However, as shown in our previous work, it is possible to identify a large fraction of spam users based on purely structural features. The hypergraph connecting documents, users, and tags can be decomposed into connected components, and any large, but non-giant components turned out to be almost entirely inhabited by spam users in the examined dataset. Here, we test to what degree the decomposition of the complete hypergraph is really necessary, examining the component structure of the induced user/document and user/tag graphs. While the user/tag graph's connectivity does not help in classifying spammers, the user/document graph's connectivity is already highly informative. It can however be augmented with connectivity information from the hypergraph. In our view, spam detection based on structural features, like the one proposed here, requires complex adaptation strategies from spammers and may complement other, more traditional detection approaches.

Spam Research Collections

Nullification Test Collections for Web Spam and SEOslides
Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell

Pages 53-60, Research in the area of adversarial information retrieval has been facilitated by the availability of the UK-2006/UK-2007 collections, comprising crawl data, link graph, and spam labels. However, research into nullifying the negative effect of spam or excessive search engine optimisation (SEO) on the ranking of non-spam pages is not well supported by these resources. Nor is the study of cloaking techniques or of click spam. Finally, the domain-restricted nature of a .uk crawl means that only parts of link-farm icebergs may be visible in these crawls. We introduce the term nullification which we define as "preventing problem pages from negatively affecting search results". We show some important differences between properties of current .uk-restricted crawls and those previously reported for the Web as a whole. We identify a need for an adversarial IR collection which is not domain-restricted and which is supported by a set of appropriate query sets and (optimistically) user-behaviour data. The billion-page unrestricted crawl being conducted by CMU (web09-bst) and which will be used in the 2009 TREC Web Track is assessed as a possible basis for a new AIR test collection. We discuss the pros and cons of its scale, and the feasibility of adding resources such as query lists to enhance the utility of the collection for AIR research.

Web Spam Challenge Proposal for Filtering in Archivesslides
András A. Benczúr, Miklós Erdélyi, Julien Masanes and Dávid Siklósi

Pages 61-62, In this paper we propose new tasks for a possible future Web Spam Challenge motivated by the needs of the archival community. The Web archival community consists of several relatively small institutions that operate independently and possibly over different top level domains (TLDs). Each of them may have a large set of historic crawls. Efficient filtering would hence require (1) enhanced use of the time series of domain snapshots and (2) collaboration by transferring models across different TLDs. Corresponding Challenge tasks could hence include the distribution of crawl snapshot data for feature generation as well as classification of unlabeled new crawls of the same or even different TLDs.