DISQUS

DISQUS Hello! window office is using DISQUS, a powerful comment system, to manage its comments. Learn more.

Community Page

Jump to original thread »
Author

window office - Open Source Search Engine Evaluation

Started by Jon Elsas · 1 year ago

No excerpt available. Jump to website »

9 comments

  • Hi, do you know any work that has addressed the design of IR relevance judgments using MechTurk? Namely on how to avoid fraud and detect bogus answers? There is a recent paper from PARC that discusses the design of user studies.
  • Sérgio -- I, unfortunately, don't have much experience with mechanical turk. I would suggest also looking at Panos Ipeirotis's blog -- he's posted a few things about mechanical turk: http://behind-the-enemy-lines.blogspot.com/sear...

    As far as exactly what to collect, I would be inclined to ask for pairwise preference information -- given a query, is document A or document B better? This is generally easier for assessors, and leads to more agreement as compared to absolute relevant judgments (relevant vs. non-relevant). (see Carterette et. al. from ECIR 2008).

    As far as fraud & bogus answers, the only reasonable approach i can see is collecting the same judgements from a variety of users and only retaining the stable judgements. I wouldn't think there would be a strong incentive for users to provide intentionally misleading answers, but certainly a risk of getting lazy users who just want to complete the task quickly.
  • Jon, I'm not sure I have 100% correct answer to your question, but let me try:

    The biggest barrier is the need to pay for the test collection. For example, the fee for GOV2 is £600 (~$1300!!). The license means you can't share it with others, of course.

    The other reason is that many of the developers just aren't interested in TREC. TREC is great if you are a researcher, but the average practioner tweaking search algorithms isn't going to participate. They are (rightly) too busy getting their search engine up and running to worry about conferences and such. They just want to have a way to objectively compare their tweaks to what other people are doing.

    If I understand the problem, I think one barrier is the redistribution of copyrighted content. Getting a publicly available, large-scale, collection of documents is difficult. Sources for this include blogs, web documents, etc... but these can't be re-distributed due to the fact that they are copyrighted.
  • If web content is copyrighted and can't be redistributed, then how does the BLOG06 collection get sold by U. Glasgow? I'm sure they didn't get permission from 100k different bloggers. Also, I would guess most of the GOV2 is in the public domain anyway. In all cases, original citations are preserved.

    I understand the Reuters & other newswire collections are under copyright, but I'm not so sure about the current web collections.
  • This came up when I was talking to Craig Macdonald at ECIR 2008, the creator of BLOG06.

    They didn't get permission. I believe they are just risking it. They could be open to IP problems if a blogger decides he doesn't like his content being redistributed. A larger dataset with more copyright owners brings greater risk.
  • That's why there are "organisation agreements" that clearly state how copyright issues are handled (see copyright section of the agreements). It is possible that some people might request the deletion of some files, and there is provision in the agreements to do so. In all cases, if this happens, the users of the collections bear the associated liability. There have never been such issues since the collections were first distributed by CSIRO.

    Creating and distributing test collections is not a straightforward task as some posts seem to suggest, that's why there are administrative fees and agreements in place. The fees are intended to contribute towards these costs in preparing and distributing the data.
  • Iadh -- Thanks for chiming in and keeping me honest. I realize that test collection creation isn't as simple as crawling the data & setting it free.
  • Iadh,

    Thanks for enlightening me about the way it works. Could you elaborate more on some of the challenges creating/distributing collections? It would be really useful for this discussion.

    I'd be keen on trying to create a significantly larger web document test collection.
  • I actually view $1300 as a very low cost to get access to a useful testbed. I've been building some small test collections at mSpoke to evaluate components of our system and our on paper costs of collecting several thousand judgments are roughly 5 times the cost of accessing the .GOV2 collection, but when you factor in the time I've spent building the right framework for assessors as well as refining the assessment task to promote assessor consistency the cost is a lot larger.

    With regards to participation at TREC, it is not uncommon for companies to participate at a minimal level of effort. You're not obligated to attend TREC or even write a paper.

    I don't want to dismiss or discredit the idea of a more open collection of relevance judgments for queries, but as Iadh points out, creating large, useful test collections is not an easy task.

Add New Comment

Returning? Login