Proceedings of AIRWeb'07

Proceedings of the 3rd international workshop on Adversarial Information Retrieval on the Web, Banff, Alberta, Canada, May 8th, 2007. ISBN 978-1-59593-732-2, ACM ICPS Vol. 215.

Session: Temporal and Topological Factors

Splog Detection Using Self-similarity Analysis on Blog Temporal Dynamicsslides
Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

Pages 1-8, This paper focuses on spam blog (splog) detection. Blogs are highly popular, new media social communication mechanisms. The presence of splogs degrades blog search results as well as wastes network resources. In our approach we exploit unique blog temporal dynamics to detect splogs.

There are three key ideas in our splog detection framework. We first represent the blog temporal dynamics using self-similarity matrices defined on the histogram intersection similarity measure of the time, content, and link attributes of posts. Second, we show via a novel visualization that the blog temporal characteristics reveal attribute correlation, depending on type of the blog (normal blogs and splogs). Third, we propose the use of temporal structural properties computed from self-similarity matrices across different attributes. In a splog detector, these novel features are combined with content based features. We extract a content based feature vector from different parts of the blog – URLs, post content, etc. The dimensionality of the feature vector is reduced by Fisher linear discriminant analysis. We have tested an SVM based splog detector using proposed features on real world datasets, with excellent results (90% accuracy).

Improving Web Spam Classification using Rank-time Featuresslides
Krysta Svore, Qiang Wu, Chris Burges and Aaswath Raman

Pages 9-16, In this paper, we study the classification of web spam. Web spam refers to pages that use techniques to mislead search engines into assigning them higher rank, thus increasing their site traffic. Our contributions are two fold. First, we find that the method of dataset construction is crucial for accurate spam classification and we note that this problem occurs generally in learning problems and can be hard to detect. In particular, we find that ensuring no overlapping domains between test and training sets is necessary to accurately test a web spam classifier. In our case, classification performance can differ by as much as 40% in precision when using non-domain-separated data. Second, we show ranktime features can improve the performance of a web spam classifier. Our paper is the first to investigate the use of rank-time features, and in particular query-dependent ranktime features, for web spam detection. We show that the use of rank-time and query-dependent features can lead to an increase in accuracy over a classifier trained using page-based content only.

Improving Web Spam Classifiers Using Link Structureslides
Qingqing Gan and Torsten Suel

Pages 17-20, Web spam has been recognized as one of the top challenges in the search engine industry [14]. A lot of recent work has addressed the problem of detecting or demoting web spam, including both content spam [16, 12] and link spam [22, 13]. However, any time an anti-spam technique is developed, spammers will design new spamming techniques to confuse search engine ranking methods and spam detection mechanisms. Machine learning-based classiffcation methods can quickly adapt to newly developed spam techniques. We describe a two-stage approach to improve the performance of common classiffers. We first implement a classiffer to catch a large portion of spam in our data. Then we design several heuristics to decide if a node should be relabeled based on the preclassiffed result and knowledge about the neighborhood. Our experimental results show visible improvements with respect to precision and recall.

Transductive Link Spam Detectionslides
Dengyong Zhou, Christopher Burges and Tao Tao

Pages 21-28, Web spam can significantly deteriorate the quality of search engines. Early web spamming techniques mainly manipulate page content. Since linkage information is widely used in web search, link-based spamming has also developed. So far, many techniques have been proposed to detect link spam. Those approaches are basically built on link-based web ranking methods.

In contrast, we cast the link spam detection problem into a machine learning problem of classification on directed graphs. We develop discrete analysis on directed graphs, and construct a discrete analogue of classical regularization theory via discrete analysis. A classification algorithm for directed graphs is then derived from the discrete regularization. We have applied the approach to real-world link spam detection problems, and encouraging results have been obtained.

Session: Link Farms

Using Spam Farm to Boost PageRankslides
Ye Du, Yaoyun Shi and Xin Zhao

Pages 29-36, Nowadays web spamming has emerged to take the economic advantage of high search rankings and threatened the accuracy and fairness of those rankings. Understanding spamming techniques is essential for evaluating the strength and weakness of a ranking algorithm, and for fighting against web spamming. In this paper, we identify the optimal spam farm structure under some realistic assumptions in the single target spam farm model. Our result extends the optimal spam farm claimed by GyÄongyi and Garcia-Molina through dropping the assumption that leakage is constant. We also characterize the optimal spam farms under additional constraints, which the spammer may deploy to disguise the spam farm by deviating from the unconstrained optimal structure.

Extracting Link Spam using Biased Random Walks from Spam Seed Setsslides
Baoning Wu and Kumar Chellapilla

Pages 37-44, Link spam deliberately manipulates hyperlinks between web pages in order to unduly boost the search engine ranking of one or more target pages. Link based ranking algorithms such as PageRank, HITS, and other derivatives are especially vulnerable to link spam. Link farms and link exchanges are two common instances of link spam that produce spam communities – i.e., clusters in the web graph. In this paper, we present a directed approach to extracting link spam communities when given one or more members of the community. In contrast to previous completely automated approaches to finding link spam, our method is specifically designed to be used interactively. Our approach starts with a small spam seed set provided by the user and simulates a random walk on the web graph. The random walk is biased to explore the local neighborhood around the seed set through the use of decay probabilities. Truncation is used to retain only the most frequently visited nodes. After termination, the nodes are sorted in decreasing order of their final probabilities and presented to the user. Experiments using manually labeled link spam data sets and random walks from a single seed domain show that the approach achieves over 95.12% precision in extracting large link farms and 80.46% precision in extracting link exchange centroids.

A Large-Scale Study of Link Spam Detection by Graph Algorithmsslides
Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa and Kazuyuki Aihara

Pages 45-48, Link spam refers to attempts to promote the ranking of spammers' web sites by deceiving link-based ranking algorithms in search engines. Spammers often create densely connected link structure of sites so called "link farm". In this paper, we study the overall structure and distribution of link farms in a large-scale graph of the JapaneseWeb with 5.8 million sites and 283 million links. To examine the spam structure, we apply three graph algorithms to the web graph. First, the web graph is decomposed into strongly connected components (SCC). Beside the largest SCC (core) in the center of the web, we have observed that most of large components consist of link farms. Next, to extract spam sites in the core, we enumerate maximal cliques as seeds of link farms. Finally, we expand these link farms as a reliable spam seed set by a minimum cut technique that separates links among spam and non-spam sites. We found about 0.6 million spam sites in SCCs around the core, and extracted additional 8 thousand and 49 thousand sites as spams with high precision in the core by the maximal clique enumeration and by the minimum cut technique, respectively.

Measuring Similarity to Detect Qualified Linksslides
Xiaoguang Qi, Lan Nie and Brian Davison

Pages 49-56, The early success of link-based ranking algorithms was predicated on the assumption that links imply merit of the target pages. However, today many links exist for purposes other than to confer authority. Such links bring noise into link analysis and harm the quality of retrieval. In order to provide high quality search results, it is important to detect them and reduce their influence. In this paper, a method is proposed to detect such links by considering multiple similarity measures over the source pages and target pages. With the help of a classifier, these noisy links are detected and dropped. After that, link analysis algorithms are performed on the reduced link graph. The usefulness of a number of features are also tested. Experiments across 53 query-specific datasets show our approach almost doubles the performance of Kleinberg’s HITS and boosts Bharat and Henzinger’s imp algorithm by close to 9% in terms of precision. It also outperforms a previous approach focusing on link farm detection.

Session: Tagging, P2P and Cloaking

Combating Spam in Tagging Systemsslides
Georgia Koutrika, Frans Effendi, Zoltán Gyöngyi, Paul Heymann and Hector García-Molina

Pages 57-64, Tagging systems allow users to interactively annotate a pool of shared resources using descriptive tags. As tagging systems are gaining in popularity, they become more susceptible to tag spam: misleading tags that are generated in order to increase the visibility of some resources or simply to confuse users. We introduce a framework for modeling tagging systems and user tagging behavior. We also describe a method for ranking documents matching a tag based on taggers’ reliability. Using our framework, we study the behavior of existing approaches under malicious attacks and the impact of a moderator and our ranking method.

New Metrics for Reputation Management in P2P Networksslides Debora Donato, Mario Paniccia, Maddalena Selis, Carlos Castillo, Giovanni Cortese and Stefano Leonardi

Pages 65-72, In this work we study the effectiveness of mechanisms for decentralized reputation management in P2P networks. We depart from EigenTrust, an algorithm designed for reputation management in file sharing applications over p2p networks. EigenTrust has been proved very effective against three different natural attacks from malicious coalitions while it performs poorly on particular attack organized by two different kinds of malicious peers. We propose various metrics of reputation based on ideas recently introduced for detecting and demoting Web spam. We combine these metrics with the original EigenTrust approach. Our mechanisms are more effective than EigenTrust alone for detecting malicious peers and reducing the number of inauthentic downloads not only for all the cases previously addressed but also for more sophisticated attacks.

Computing Trusted Authority Scores in Peer-to-Peer Web Search Networksslides
Josiane Xavier Parreira, Debora Donato, Carlos Castillo and Gerhard Weikum

Pages 73-80, Peer-to-peer (P2P) networks have received great attention for sharing and searching information in large user communities. The open and anonymous nature of P2P networks is one of its main strengths, but it also opens doors to manipulation of the information and of the quality ratings.

In our previous work (J. X. Parreira, D. Donato, S. Michel and G. Weikum in VLDB 2006) we presented the JXP algorithm for distributed computing PageRank scores for information units (Web pages, sites, peers, social groups, etc.) within a link- or endorsement-based graph structure. The algorithm builds on local authority computations and bilateral peer meetings with exchanges of small data structures that are relevant for gradually learning about global properties and eventually converging towards global authority rankings.

In the current paper we address the important issue of cheating peers that attempt to distort the global authority values, by providing manipulated data during the peer meetings. Our approach to this problem enhances JXP with statistical techniques for detecting suspicious behavior. Our method, coined TrustJXP, is again completely decentralized, and we demonstrate its viability and robustness in experiments with real Web data.

A Taxonomy of JavaScript Redirection Spamslides
Kumar Chellapilla and Alexey Maykov

Pages 81-88, Redirection spam presents a web page with false content to a crawler for indexing, but automatically redirects the browser to a different web page. Redirection is usually immediate (on page load) but may also be triggered by a timer or a harmless user event such as a mouse move. JavaScript redirection is the most notorious of redirection techniques and is hard to detect as many of the prevalent crawlers are script-agnostic. In this paper, we study common JavaScript redirection spam techniques on the web. Our findings indicate that obfuscation techniques are very prevalent among JavaScript redirection spam pages. These obfuscation techniques limit the effectiveness of static analysis and static feature based systems. Based on our findings, we recommend a robust counter measure using a light weight JavaScript parser and engine.

Session: Web Spam Challenge

Web Spam Detection via Commercial Intent Analysisslides
András A. Benczúr, István Bíró, Károly Csalogány and Tamás Sárlós

Pages 89-92, We propose a number of features for Web spam filtering based on the occurrence of keywords that are either of high advertisement value or highly spammed. Our features include popular words from search engine query logs as well as high cost or volume words according to Google AdWords. We also demonstrate the spam filtering power of the Online Commerical Intention (OCI) value assigned to an URL in a Microsoft adCenter Labs Demonstration and the Yahoo! Mindset classification of Web pages as either commercial or non-commercial as well as metrics based on the occurrence of Google ads on the page. We run our tests on the WEBSPAM-UK2006 dataset recently compiled by Casilllo et al. as a standard means of measuring the performance of Web spam detection algorithms. Our features improve the classification accuraracy of the publicly available WEBSPAM-UK2006 features by 3%.