IPM Special Issue on Large-Scale Distributed Systems for Information Retrieval

We are pleased to announce that we are preparing a special issue on the workshop topics which will be published in the Information Processing and Management Journal by Elsevier. You can find the CFP on the journal webpage and here.

Submissions are open to any contribution in the field of large scale distributed systems for information retrieval. Given the high quality of the papers presented at this year edition of the LSDS-IR workshop, we particularly invite extended versions of those works.

Contributions must be submitted by February 1st, 2010 at the following url: http://ees.elsevier.com/ipm/default.asp.

News Updates

2009-11-30 We are organizing a special issue on the workshop topics to be published on Information Processing and Management by Elsevier.
2009-07-14 The workshop proceedings have been published on-line by CEUR-WS.
2009-07-10 The workshop proceedings are available on-line.
2009-07-07 Techical program is now available here. The workshop will host two invited talks.
2009-07-02 Sample copyright box of the camera ready papers. donwload.
2009-06-28 Notifications were sent to authors. Soon, we will publish the workshop program.
2009-05-08 The Information Processing & Management Jorunal will publish a Special Issue on LSDS-IR.
2009-04-28 Submission website is now open. Submit your paper !
2009-04-16 Call for papers available in pdf and txt formats.
2009-03-11 The workshop site goes online.

Workshop Evaluation

We thank the invited speakers and all the participants, they made the workshop successful !
If you participated to the workshop, please fill the evaluation form http://tiny.cc/sigir2009ws.


The Web is continuously growing. Currently, there are more than 20 billions pages (some sources suggest 100 billions), compared to less than 1 billion documents in 1998. Traditionally, Web-scale search engines employ large and highly replicated systems, operating on computer clusters in one or few data centers. Coping with the increasing number of user requests and indexable pages requires adding more resources. However, data centers cannot grow indefinitely. Scalability problems in information retrieval have to be addressed in the near future, and new distributed applications are likely to drive the way in which people use the Web. Distributed IR is the point in which these two directions converge. This workshop will provide space for researchers to discuss these problems and to define new directions for the work on distributed information retrieval.


Every regular paper will have 30 minutes for the presentation, including 5-10 minutes for questions. Short papers will have 15 minutes for presentation plus 5 for questions. The workshop proceedings, including all the accpted papers, are available here.

09:00 - 09:10 Welcome

Session I: "YouTube" , chaired by Wai Gen Yee.
09:10 - 10:10 Keynote: "The Youtube Video Delivery System" by Leonidas Kontothanassis (Google, Boston, USA).
10:10 - 10:30
"Are Web User Comments Useful for Search?"
by Wai Gen Yee, Andrew Yates, Shizhu Liu, Ophir Frieder (Illinois Institute of Technology).
Special guest: Martin Potthast (Bauhaus-Universit├Ąt Weimar), with great enthusiasm, shared withus his ideas and findings on "Measuring the Descriptiveness of Web Comments" (website).
10:30 - 11:00 Coffee break

Session II: "Search", chaired by Claudio Lucchese.
11:00 - 11:30
"PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search"
by Andrea Esuli (ISTI-CNR, Pisa, Italy).
11:30 - 12:00
"Collection Selection with Highly Discriminative Keys"
by Sander Bockting (Avanade), Djoerd Hiemstra (U of Twente).
12:00 - 12:20
"Peer-to-Peer clustering of Web-browsing users"
by Patrizio Dazzi, Matteo Mordacchini, Raffaele Perego (ISTI-CNR, Pisa, Italy), Pascal Felber, Lorenzo Leonini (U of Neuchatel), Martin Rajman (EPFL), Etienne Riviere (NTNU).
12:20 - 13:30 Lunch break

Session III: "Large Scale", chaired by Wai Gen Yee.
13:30 - 14:30 Keynote: "DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language" by Dennis Fetterly (Microsoft Research, Silicon Valley, USA).
14:30 - 15:00
"Strong Ties vs. Weak Ties: Studying the Clustering Paradox for Decentralized search"
by Weimao Ke, Javed Mostafa (U of North Carolina).
15:00 - 15:30 Coffee break

Session IV: "Large Scale cont.", chaired by Gleb Skobeltsyn.
15:30 - 16:00
"Sorting using BItonic netwoRk wIth CUDA"
by Gabrielle Capannini, Fabrizio Silvestri, Ranieri Baraglia, Franco Maria Nardini (ISTI-CNR, Pisa, Italy).
16:00 - 16:30
"Static Index Pruning for Information Retrieval Systems: A Posting-Based Approach"
by Linh Nguyen (Illinois Institute of Technology).
16:30 - 17:00
"Comparing Distributed Indexing: To MapReduce or Not?"
by Richard McCreadie, Craig Mcdonald, Iadh Ounis (U of Glasgow).
17:00 - 17:20
"The Curse of Zipf and Limits to Parallelization: A Look at the Stragglers Problem in MapReduce"
by Jimmy Lin (U of Maryland).
17:20 - 17:30 Concluding Remarks

Keynote speakers

Speaker: Leonidas Kontothanassis. He joined Google in 2006 and immediately started working on networking and video delivery issues and have been ever since. he currently acts as the manager of the teams working in these areas. Previously he has worked in such areas as computer architecture, parallel programming, and content delivery with multiple companies in the Kendall/MIT area include DEC/HP/Intel Labs and Akamai. He received a PhD in computer architecture in 1996 and has served as committee member or organizer for academic conferences and research funding organizations like NSF.
Title: "The Youtube Video Delivery System". This talk will cover the Youtube Video Delivery System. It will discuss access patterns and trends for both video uploads and downloads. It will describe the storage and delivery mechanisms for popular and unpopular content and the impact YouTube has on the network storage infrastructure for Google. We will also discuss the networking impact for ISPs around the world.

Speaker: Dennis Fetterly. He is a Research Software Development Engineer in Microsoft Research's Silicon Valley lab, which he joined in May, 2003. His research interests include a wide variety of topics including web crawling, the evolution and similarity of pages on the web, identifying spam web pages, and large scale distributed systems. He is currently working on DryadLINQ, TidyFS, and a project evaluating policies for corpus selection. Interesting past projects include the MSRBot web crawler, Dryad, the Your Desktop and Your Keychain projecy, which utilizes flash memory devices to enable users to carry their desktop PC state with them from machine to machine, and PageTurner, a large scale study of the evolution of web-pages.
Title: "DryadLINQ: A system for general-purpose distributed data-parallel computing using high-level language". The goal of DryadLINQ is to make distributed computing on large compute clusters simple. DryadLINQ combines two important pieces of technology: the Dryad distributed execution engine and the .NET Language INtegrated Query (LINQ). Dryad provides reliable, distributed computing on thousands of applications in a SQL-like query language, relying on the entire -NET library and using Visual Studio. DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC clusters. This talk will also describe the experience using DryadLINQ for a series of information retrieval experiments.

Workshop Activities and Goals

The workshop aims to bring together researchers from the domains of IR and databases working on peer-to-peer information systems and to foster closer collaboration that could have a large impact on future research directions in the area of distributed and P2P IR.

This workshop continues the efforts from previous workshops.


Topics of interest include, but are not limited to:

Workshop chairs

Steering Committee

Program Committee

Workshop format

The workshop solicits scientific papers that address problems specific to IR in heterogeneous and distributed environments. Additionally, position papers outlining interesting new research domains and approaches are welcome. The selection of papers is based primarily on their potential to influence future research. Papers have to present original contributions not concurrently submitted elsewhere.

Paper Submission

Papers should not exceed 8 pages, double column, including figures, tables and references in the standard ACM Conference style (for LaTeX, use the "Option 2" style). Papers have to present original research contributions not concurrently submitted elsewhere, and must be submitted electronically in printable PDF format (other formats will be rejected) via the online submission system. Submitted papers will undergo a peer review process by at least three members of the program committee. Submission is not blind.

At least one author of an accepted paper must register for the workshop. Registration must be done at the time when the author sends the camera-ready copy of the paper. Here you can find a sample of copyright box for the camera ready papers. Further instrunctions are available through the online submission system.

Best papers will be invited to submit an extended version to the Special Issue on "Large-Scale Distributed Systems for Information Retrieval", published by the Information Processing & Management Journal.

Important Dates

Paper submission:  June 15, 2009 closed
Notification:  June 27, 2009
Camera-ready papers:  July 4, 2009
Workshop date:  July 23, 2009