Community Page
- windowoffice.tumblr.com Jump to website »
-
Subscribe -
Community
-
Top Commenters
-
Popular Threads
-
Recent Comments
- all uploads finally went through OK. Not all of the authors got confirmation emails.
- Ours uploaded ok, but we never received a confirmation email. But I suppose that could be an error at our end. Let me know if you do get the confirm! (If your upload ever finishes...)
- 340K If that's too fat, then we've got real problems.
- how fat is your paper?
- Damn, my university's web site is only worth $63,000...
1 year ago
1 year ago
As far as exactly what to collect, I would be inclined to ask for pairwise preference information -- given a query, is document A or document B better? This is generally easier for assessors, and leads to more agreement as compared to absolute relevant judgments (relevant vs. non-relevant). (see Carterette et. al. from ECIR 2008).
As far as fraud & bogus answers, the only reasonable approach i can see is collecting the same judgements from a variety of users and only retaining the stable judgements. I wouldn't think there would be a strong incentive for users to provide intentionally misleading answers, but certainly a risk of getting lazy users who just want to complete the task quickly.
1 year ago
The biggest barrier is the need to pay for the test collection. For example, the fee for GOV2 is £600 (~$1300!!). The license means you can't share it with others, of course.
The other reason is that many of the developers just aren't interested in TREC. TREC is great if you are a researcher, but the average practioner tweaking search algorithms isn't going to participate. They are (rightly) too busy getting their search engine up and running to worry about conferences and such. They just want to have a way to objectively compare their tweaks to what other people are doing.
If I understand the problem, I think one barrier is the redistribution of copyrighted content. Getting a publicly available, large-scale, collection of documents is difficult. Sources for this include blogs, web documents, etc... but these can't be re-distributed due to the fact that they are copyrighted.
1 year ago
I understand the Reuters & other newswire collections are under copyright, but I'm not so sure about the current web collections.
1 year ago
They didn't get permission. I believe they are just risking it. They could be open to IP problems if a blogger decides he doesn't like his content being redistributed. A larger dataset with more copyright owners brings greater risk.
1 year ago
Creating and distributing test collections is not a straightforward task as some posts seem to suggest, that's why there are administrative fees and agreements in place. The fees are intended to contribute towards these costs in preparing and distributing the data.
1 year ago
1 year ago
Thanks for enlightening me about the way it works. Could you elaborate more on some of the challenges creating/distributing collections? It would be really useful for this discussion.
I'd be keen on trying to create a significantly larger web document test collection.
1 year ago
With regards to participation at TREC, it is not uncommon for companies to participate at a minimal level of effort. You're not obligated to attend TREC or even write a paper.
I don't want to dismiss or discredit the idea of a more open collection of relevance judgments for queries, but as Iadh points out, creating large, useful test collections is not an easy task.