Common evaluation-suite for search tools, etc.
Evaluation - at the higher levels, especially - is closely-related to use-scenarios. We will need a way to know, for each significant change to a search algorithm or method, how well it does both as compared to other algorithms or methods and at the different kinds of task we want to address.
We should be able to set up a standard test-dataset, and a suite of query and relevance-judgement sets fairly trivially within 02. We'll need as many of the latter two types as there are principal use-scenarios we envisage. This means (I believe) that to some extent we should be able to 'automate' production of precision/recall curves, for example, so they can be displayed to testers as soon as a search experiment is done.
That way one would be able to see pretty soon how one's new algorithm works against classical music, rock, hip-hop, dance, jazz, etc., etc. I don't think I've ever seen that kind of comparison at ISMIR, for example.
Obviously, certain low-level tools will need their own kind of testing at an earlier stage as well, but having such an "evaluation suite" available early on will help keep the focus on real-world (?) music retrieval. The proposed O2 architecture should make this possible.
--
TimCrawford - 18 Jan 2007