The different experiments show how stark the contrast is in scores depending on how one defines "relevant." Experiment one called a link good if it at least technically satisfied the search expression. Here, the overall median was a healthy 0.81 with the best service scoring a 0.93. If ones definition of relevant is stricter, dealing with only potentially useful pages, the overall median drops to 0.39, with the top scorer only making an estimated median of 0.51. If ones criterion is a page that is very likely to be useful, the median disappears down to 0.06, with the top scorer only rising to 0.10. (Indeed, if one performs a test for mean and standard deviation on experiment three's data, the standard deviation is higher than the mean.)
Neutralizing duplicates only raised those overall estimated medians slightly, and that rise was almost entirely accounted for in Hotbot's improved score. See Table 2 for a detailed breakdown of the Friedmann's estimated medians for each service, for the Friedmann's sum of ranks for each service, for the overall medians, and for the least significant difference in sum of ranks for each experiment.
Table 2: The Estimated Medians and Sums of
Ranks for the Experiments
Service Med. 1 Sum 1 Med. 2 Sum 2 Med. 3 Sum 3 ------------------------------------------------------------------------- Alta Vista 0.9032 59.0 0.4523 53.0 0.06022 46.0 Excite 0.9362 66.0 0.4717 55.0 0.07168 50.0 Hotbot 0.7197 30.0 0.2925 33.5 0.03871 35.5 Infoseek 0.8659 46.5 0.5054 57.0 0.09892 59.5 Lycos 0.6072 23.5 0.2746 26.5 0.03154 34.0 ------------------------------------------------------------------------- GrandMedian 0.8065 - 0.3993 - 0.06022 - Sum of Ranks LSD - 10.91 - 17.98 - 9.28 Fried. p-value 0.000 0.000 0.006
Service Med. 4 Sum 4 Med. 5 Sum 5 ---------------------------------------------------- Alta Vista 0.9032 56.5 0.4741 52.5 Excite 0.9321 63.5 0.4789 55.0 Hotbot 0.8246 40.5 0.3514 35.0 Infoseek 0.8600 43.0 0.5347 56.0 Lycos 0.6009 21.5 0.3039 26.5 ---------------------------------------------------- GrandMedian 0.8242 - 0.4286 - Sum of Ranks LSD - 12.24 - 14.68 Fried. p-value 0.000 0.001
Table 2: The estimated population median and sum of Friedmann's ranks is given for each service. Below the individual medians are the overall medians for the experiment. Below the individual sum of ranks are the sizes of the least significant difference between sums of ranks. If the difference between the sums of ranks for two services is greater than this number, the difference is significant. Finally, the p-value (adjusted for ties) is reported for each Friedmann's test.
Table 3: The rankings and significant differences
among the services
Rankings Experiment Lowest Highest -------------------------------------------------------- 1 Lyc Hot Inf Alt Exc --------- --- --------- 2 Lyc Hot Alt Exc Inf --------- --------------- 3 Lyc Hot Alt Exc Inf --------- --------- --- 4 Lyc Hot Inf Alt Exc --- --------- --------- 5 Lyc Hot Alt Exc Inf --------- ---------------Table 3: Each service is ranked by experiment. Underlining indicates no significant difference. Alt = Alta Vista, Exc = Excite, Hot = Hotbot, Inf = Infoseek, Lyc = Lycos.
Within these ranges a definite pattern emerges. Alta Vista, Excite and Infoseek are always the services with the three highest estimated median scores. Their scores are not always significantly higher than Hotbot, but they are always significantly higher than Lycos. Only in experiment four, a version of experiment one where duplicates were neutralized rather than penalized, does Hotbot have a significantly higher estimated median than Lycos.
Table 3 shows the rankings of the services for each experiment. If an underline connects two services, their estimated medians are not significantly different from one another. For instance, in experiment one, Excite and Alta Vista have significantly higher estimated medians than Infoseek, Hotbot and Lycos. So in terms of delivering non-duplicate, active links that at least technically satisfy the search query, they are better. Infoseek is in the middle, statistically worse than the top two, but better than Hotbot and Lycos.
Figure 1: The Sum of Ranks and the Least
Significant Difference Interval for Services in Experiment Two
Experiment two is closest in criteria to what we ourselves would define relevance to be. In Table 3, one can see that for experiment two, Infoseek, Excite and Alta Vista are not significantly different from one another, but are all significantly higher than Hotbot and Lycos. Figure 1 shows how the Friedmann's sum of ranks for each service in experiment two compared with the least significant difference of sum of ranks.
We have also conducted a correspondence analysis of the queries by the services, using the scores from experiment two as weights. Lycos's performance corresponds best for query 8, followed by queries 9 and 5, all of which are shorter and unstructured. Hotbot's performance matches queries 12, 6 and 13 best. Queries 12 and 13 are structured, and 6 is a proper name. While Lycos and Hotbot have comparable scores, Lycos does best with shorter, unstructured queries, while Hotbot does better with structured queries. This result is not surprising, since Lycos lacks many operators, and Hotbot has many of them. See Figure 2 for a graph of the correspondence analysis that contains in two dimensions 78.7% of the information from the analysis.
Figure 2: A Correspondence Analysis of Service
by Query using the Scores in Experiment Two as Weights
When one examines quantile box charts of the scores, one can see that for experiment two, while the median for Alta Vista and Excite are similar, the variability of scores for Alta Vista is much wider, and for Excite it is much narrower. Hotbot is more consistant, but scoring consistantly lower. See Figure 3 for a box chart for experiment two.
Figure 3: Quantile Box Plot of Median and
Range for each Service in Experiment Two
One can compare this to a quantile box chart from experiment one, see Figure 4. Here, Alta Vista is not variable at all, indicating that it is very good at delivering pages that at least technically satisfy the expression. Only Lycos has serious variability.This observation matches well with Venditto's observation that it has very good recall (Venditto, 1996).
Figure 4: Quantile Box Plot of Median and
Range for each Service in Experiment One
In experiment three, the score for the best pages, Infoseek is the clear winner, being significantly higher than all other services. Alta Vista and Excite land in the middle, being significantly better than Hotbot and Lycos, which shared the bottom. See Figure 5 for a display of the least significant difference in Friedmann's sum of ranks for experiment three. Any excitement on the part of Infoseek's developers that this victory might generate should be tempered by the realization that on the Web, even the best search services deliver only an estimated median of 10% first twenty precision for very useful pages.
Figure 5: The Sum of Ranks and the Least
Significant Difference Interval for Services in Experiment Three
Experiment four altered experiment one by having duplicates neutralized. Here, Hotbot succeeded in being significantly higher in estimated median than Lycos and not significantly lower than Infoseek. But in experiment five, neutralized duplicates helped Hotbot little. Alta Vista, Excite and Infoseek shared the upper cluster, while Hotbot was not significantly higher than Lycos.
VII. Summary and Future Work
It is clear looking at the data that, in general, Alta Vista, Excite, and Infoseek did a superior job delivering quality relevance. Alta Vista did its best in experiment one, then slipped relative to Excite and Infoseek in the higher categories. In retrospect, for experiments two and three, it is entirely possible that Alta Vista might have done better in the higher categories if we had employed its proximity operators to explicitly order a closer relation. This point is well worth considering when searching Alta Vista. Further research needs to be done in that area. Infoseek suffered from a low but naggingly persistent level of inactive links, which may be why it did significantly worse in experiment one, but it managed to score well anyway when pages that were judged relevant but not useful were penalized. Indeed, it ran away with the clear first place in the arid environment of experiment three. See Table B6b in Appendix B for the medians, means and modes of the inactive links, the duplicate links and the category zero links for each service.
Hotbot suffered from duplicates. Yet, even when they were neutralized, it was still significantly lower than the top services. Lycos had the most number of zeros, and it was evident to the evaluator that this can be in part attributed to Lycos's inability to require that a specific word be present in the results (the plus sign operator), which all of the other services had available in some form. Lycos also had the most number of inactive links, but it did not have duplicates. After having evaluated the pages for query 14, prozac, we went back and entered first the capitalized version of the word, and then the lower case version of each word in each service to find out if the results were different. It was discovered that Hotbot (unstructured) and Lycos did not have different results, indicating that they do not implement the fuzzy match operation on unstructured expressions. (The fuzzy match will match the lower case expression to both lower and upper case words, but will match the upper case expression only to the upper case.) The fact that the two lower scoring services do not implement this feature may have been a factor in their performance as well.
Because Lycos had scored high in Leighton's 1995 study (Leighton, 1995), it is good to ask why it has compared so poorly with the other services in this study. The first factor is that the competition has gotten better.  Only Infoseek was also in the previous study. Also, Lycos's score in experiment two, the experiment most comparable to the first ten precision experiment in the 1995 study, is not as bad as the low scorers in 1995. Finally, with the many unstructured text queries with this study, we believe that Lycos did not add enough of its own Boolean and proximity ranking to the results that it returned. All of the queries in the 1995 study were structured for Lycos. In the previous study, we had not discovered and did not use the Infoseek plus operator, which now every service other than Lycos uses to good advantage. In preliminary queries preparing for this study, we used structured queries in Lycos which had tighter constraints (requiring all the words or three of the words to be present). When we required all words, we often got no results. When we required three words, Lycos often picked the wrong three and still got poor precision.
To analyze these services more thoroughly, future research should aim to conduct a study where the test suite is large enough to compare structured search expressions versus unstructured ones. Ideally, one would also compare all major services, including Open Text and WebCrawler. One might also investigate under what conditions it would be "fair" to compare search services with review services such a Magellan and Excite Reviews, or Yahoo. Naturally, over time, the results of any precision study will become dated, and a study like this will need to be repeated anyway.
Return to the Table of Contents.
18. Indeed, the rankings for these five will no doubt change soon, as their operators continue to improve the software.
Chu, Heting and Marilyn Rosenthal. (1996). "Search engines for the World Wide Web: A comparative study and evaluation methodology," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 127-135. Also available: http://www.asis.org/annual-96/ElectronicProceedings/chu.html [28 January 1997].
Conover, W. J. (1980). Practical Nonparametric Statistics. 2nd Ed. New York: John Wiley and Sons.
Ding, Wei and Gary Marchionini. (1996). "A comparative study of web search service performance," ASIS 1996 Annual Conference Proceedings, Baltimore, MD, October 19-24, 1996, 136-142.
Gauch, Susan and Guijun Wang (1996). "Information Fusion with ProFusion," Webnet 96 Conference, San Francisco, CA, October 15-19, 1996. [online]. Available: http://www.csbs.utsa.edu:80/info/webnet96/html/155.htm [22 February 1997].
Harman, Donna. (1995). "Overview of the Second Text Retrieval Conference (TREC-2)," Information Processing and Management, v31 n3, 271-289.
Infoseek. (1997). Infoseek: Precision vs. Recall. [online]. Available: http://www.infoseek.com/doc?pg=prec_rec.html [7 February 1997].
Leighton, H. Vernon. (1995). Performance of four World Wide Web (WWW) Index Services: Infoseek, Lycos, WebCrawler, and WWWWorm. [online]. Available: http://www.winona.edu/is-f/library-f/webind.htm [1 July 1996].
Magellan Internet Guide. (1997). Real-time Magellan Searches. [Online]. Available: http://voyeur.mckinley.com/voyeur.cgi [24 January 1997].
Munro, Jay and David Lidsky. (1996). "Web search sites," PC Magazine, v15 n21 (December 3, 1996), 232.
Singh, Amarendra and David Lidsky. (1996). "All-Out Search." PC Magazine, v15 n21 (December 3, 1996), p. 213 (17).
Scoville, Richard. (1996). "Special Report: Find it on the Net!" PC World, v14 n1 (January 1996), p. 125 (6). Also Available http://www.pcworld.com/reprints/lycos.htm [1 February 1997].
Tomaiuolo, Nicholas G. and Joan G. Packer. (1996). "An analysis of Internet search engines: assessment of over 200 search queries." Computers in Libraries. v16 n6 (June 1996), p58 (5). The list of queries used is in: Quantitative Analysis of Five WWW "Search Engines". [online]. Available: http://neal.ctstateu.edu:2001/htdocs/websearch.html [7 February 1997].
Venditto, Gus. (1996). "Search Engine Showdown." Internet World, v7 n5 (May 1, 1996), 78-86.
Venditto, Gus. (1997). "Critic's Choice." Internet World, v8 n1 (January 1, 1997), 83-96.
Westera, Gillian. (25 November 1996). Search Engine Comparison: Testing Retrieval and Accuracy. [online] Available: http://www.curtin.edu.au/curtin/library/staffpages/gwpersonal/senginestudy/results.htm [7 February 1997].