Proceedings
Proceedings of the 5th international workshop on Adversarial Information Retrieval on the Web, Madrid, Spain, April 21, 2009.
The workshop proceedings were published as a volume in the ACM International Conference Proceedings with ISBN
978-1-60558-438-6.
Invited Talks
The Potential for Research and Development in Adversarial Information Retrieval — slides
Brian D. Davison
Web Spam Challenges: Looking Backward and Forward — slides
Carlos Castillo
Temporal Analysis
Looking into the Past to Better Classify Web Spam — slides
Na Dai, Brian D. Davison and Xiaoguang Qi
Pages 1-8, http://doi.acm.org/10.1145/1531914.1531916
Web spamming techniques aim to achieve undeserved rankings in
search results. Research has been widely conducted on identifying
such spam and neutralizing its influence. However, existing spam
detection work only considers current information. We argue that
historical web page information may also be important in spam
classification. In this paper, we use content features from historical
versions of web pages to improve spam classification. We use
supervised learning techniques to combine classifiers based on
current page content with classifiers based on temporal features.
Experiments on the WEBSPAM-UK2007 dataset show that our
approach improves spam classification F-measure performance by
30% compared to a baseline classifier which only considers current
page content.
A Study of Link Farm Distribution and Evolution Using a Time Series of Web Snapshots — slides
Young-joo Chung, Masashi Toyoda and Masaru Kitsuregawa
Pages 9-16, http://doi.acm.org/10.1145/1531914.1531917
In this paper, we study the overall link-based spam structure
and its evolution which would be helpful for the development
of robust analysis tools and research for Web spamming as a
social activity in the cyber space. First, we use strongly connected
component (SCC) decomposition to separate many
link farms from the largest SCC, so called the core. We
show that denser link farms in the core can be extracted by
node filtering and recursive application of SCC decomposition
to the core. Surprisingly, we can find new large link
farms during each iteration and this trend continues until at
least 10 iterations. In addition, we measure the spamicity
of such link farms. Next, the evolution of link farms is examined
over two years. Results show that almost all large
link farms do not grow anymore while some of them shrink,
and many large link farms are created in one year.
Web Spam Filtering in Internet Archives— slides
Miklós Erdélyi, András A. Benczúr, Julien Masanes and Dávid Siklósi
Pages 17-20, http://doi.acm.org/10.1145/1531914.1531918
While Web spam is targeted for the high commercial value of topranked
search-engine results, Web archives observe quality deterioration
and resource waste as a side effect. So far Web spam filtering
technologies are rarely used by Web archivists but planned in the
future as indicated in a survey with responses from more than 20
institutions worldwide. These archives typically operate on a modest
level of budget that prohibits the operation of standalone Web
spam filtering but collaborative efforts could lead to a high quality
solution for them.
In this paper we illustrate spam filtering needs, opportunities and
blockers for Internet archives via analyzing several crawl snapshots
and the difficulty of migrating filter models across different
crawls via the example of the 13 .uk snapshots performed
by UbiCrawler that include WEBSPAM-UK2006 and WEBSPAM-UK2007.
Content Analysis
Web Spam Identification Through Language Model Analysis— slides
Juan Martinez-Romo and Lourdes Araujo
Pages 21-28, http://doi.acm.org/10.1145/1531914.1531920
This paper applies a language model approach to different
sources of information extracted from a Web page, in order
to provide high quality indicators in the detection of
Web Spam. Two pages linked by a hyperlink should be
topically related, even though this were a weak contextual
relation. For this reason we have analysed different sources
of information of a Web page that belongs to the context of
a link and we have applied Kullback-Leibler divergence on
them for characterising the relationship between two linked
pages. Moreover, we combine some of these sources of information
in order to obtain richer language models. Given
the different nature of internal and external links, in our
study we also distinguished these types of links getting a
significant improvement in classification tasks. The result
is a system that improves the detection of Web Spam on
two large and public datasets such as WEBSPAM-UK2006 and
WEBSPAM-UK2007.
An Empirical Study on Selective Sampling in Active Learning for Splog Detection — slides
Taichi Katayama, Takehito Utsuro, Yuuki Sato, Takayuki Yoshinaka, Yasuhide Kawada and Tomohiro Fukuhara
Pages 29-36, http://doi.acm.org/10.1145/1531914.1531921
This paper studies how to reduce the amount of human supervision
for identifying splogs / authentic blogs in the context
of continuously updating splog data sets year by year.
Following the previous works on active learning, against the
task of splog / authentic blog detection, this paper empirically
examines several strategies for selective sampling in
active learning by Support Vector Machines (SVMs). As a
confidence measure of SVMs learning, we employ the distance
from the separating hyperplane to each test instance,
which have been well studied in active learning for text classification.
Unlike those results of applying active learning
to text classification tasks, in the task of splog / authentic
blog detection of this paper, it is not the case that adding
least confident samples performs best.
Linked Latent Dirichlet Allocation in Web Spam Filtering — slides
István Bíró, Dávid Siklósi, Jácint Szabó and András Benczúr
Pages 37-40, http://doi.acm.org/10.1145/1531914.1531922
Latent Dirichlet allocation (LDA) (Blei, Ng, Jordan 2003)
is a fully generative statistical language model on the content
and topics of a corpus of documents. In this paper
we apply an extension of LDA for web spam classification.
Our linked LDA technique takes also linkage into account:
topics are propagated along links in such a way that the
linked document directly influences the words in the linking
document. The inferred LDA model can be applied for
classification as dimensionality reduction similarly to latent
semantic indexing. We test linked LDA on the WEBSPAM-UK2007
corpus. By using BayesNet classifier, in terms of
the AUC of classification, we achieve 3% improvement over
plain LDA with BayesNet, and 8% over the public link features
with C4.5. The addition of this method to a log-odds
based combination of strong link and content baseline classifiers
results in a 3% improvement in AUC. Our method
even slightly improves over the best Web Spam Challenge
2008 result.
Social Spam
Social Spam Detection — slides
Benjamin Markines, Ciro Cattuto and Filippo Menczer
Pages 41-48, http://doi.acm.org/10.1145/1531914.1531924
The popularity of social bookmarking sites has made them prime
targets for spammers. Many of these systems require an administrator’s
time and energy to manually filter or remove spam. Here
we discuss the motivations of social spam, and present a study
of automatic detection of spammers in a social tagging system.
We identify and analyze six distinct features that address various
properties of social spam, finding that each of these features provides
for a helpful signal to discriminate spammers from legitimate
users. These features are then used in various machine learning
algorithms for classification, achieving over 98% accuracy in detecting
social spammers with 2% false positives. These promising
results provide a new baseline for future efforts on social spam. We
make our dataset publicly available to the research community.
Tag Spam Creates Large Non-Giant Connected Components — slides
Nicolas Neubauer, Robert Wetzker and Klaus Obermayer
Pages 49-52, http://doi.acm.org/10.1145/1531914.1531925
Spammers in social bookmarking systems try to mimick
bookmarking behaviour of real users to gain the attention
of other users or search engines. Several methods have been
proposed for the detection of such spam, including domain specific
features (like URL terms) or similarity of users to
previously identified spammers. However, as shown in our
previous work, it is possible to identify a large fraction of
spam users based on purely structural features. The hypergraph
connecting documents, users, and tags can be decomposed
into connected components, and any large, but non-giant
components turned out to be almost entirely inhabited
by spam users in the examined dataset. Here, we test
to what degree the decomposition of the complete hypergraph
is really necessary, examining the component structure
of the induced user/document and user/tag graphs.
While the user/tag graph's connectivity does not help in
classifying spammers, the user/document graph's connectivity
is already highly informative. It can however be augmented
with connectivity information from the hypergraph.
In our view, spam detection based on structural features, like
the one proposed here, requires complex adaptation strategies
from spammers and may complement other, more traditional
detection approaches.
Spam Research Collections
Nullification Test Collections for Web Spam and SEO
— slides
Timothy Jones, David Hawking, Ramesh Sankaranarayana and Nick Craswell
Pages 53-60, http://doi.acm.org/10.1145/1531914.1531927
Research in the area of adversarial information retrieval has
been facilitated by the availability of the UK-2006/UK-2007
collections, comprising crawl data, link graph, and spam labels.
However, research into nullifying the negative effect
of spam or excessive search engine optimisation (SEO) on
the ranking of non-spam pages is not well supported by
these resources. Nor is the study of cloaking techniques
or of click spam. Finally, the domain-restricted nature of a
.uk crawl means that only parts of link-farm icebergs may
be visible in these crawls. We introduce the term nullification
which we define as "preventing problem pages from
negatively affecting search results". We show some important
differences between properties of current .uk-restricted
crawls and those previously reported for the Web as a whole.
We identify a need for an adversarial IR collection which is
not domain-restricted and which is supported by a set of
appropriate query sets and (optimistically) user-behaviour
data. The billion-page unrestricted crawl being conducted
by CMU (web09-bst) and which will be used in the 2009
TREC Web Track is assessed as a possible basis for a new
AIR test collection. We discuss the pros and cons of its scale,
and the feasibility of adding resources such as query lists to
enhance the utility of the collection for AIR research.
Web Spam Challenge Proposal for Filtering in Archives— slides
András A. Benczúr, Miklós Erdélyi, Julien Masanes and Dávid Siklósi
Pages 61-62, http://doi.acm.org/10.1145/1531914.1531928
In this paper we propose new tasks for a possible future Web Spam
Challenge motivated by the needs of the archival community. The
Web archival community consists of several relatively small institutions
that operate independently and possibly over different top
level domains (TLDs). Each of them may have a large set of historic
crawls. Efficient filtering would hence require (1) enhanced
use of the time series of domain snapshots and (2) collaboration by
transferring models across different TLDs. Corresponding Challenge
tasks could hence include the distribution of crawl snapshot
data for feature generation as well as classification of unlabeled
new crawls of the same or even different TLDs.