The Future of Web Search

Chris Sherman. Online. Volume 23, Issue 3. May/Jun 1999.

“The trouble with our times is that the future is not what it used to be.” ~ Paul Valery

Only a divinely touched visionaryor a lucky fool-could have predicted in 1989 that the Web as we know it today would emerge from the Mesh proposal-originally a funding pitch to CERN management for a project that would use hypertext to manage general information about particle accelerators and physics experiments at the lab (proposed and authored by Tim Berners-Lee, creator of the Web). Even less predictable was that a humble link directory created by a couple of graduate students to help people find their way around the nascent Web, or that a few experimental indexing services quaintly dubbed “search engines,” would ultimately become media superstars. Or that they would attract billions of dollars from investors, as they transformed their creators from software jockeys into internationally regarded capitalist icons.

The Web may very well be the most elegant real-world manifestation of the central metaphor of chaos theory. You’ve probably heard the idea-that the ephemeral flapping of a butterfly’s wing in southern Africa triggers a chain reaction of events leading to a hurricane of vast elemental strength, unleashing torrential forces on the far side of the world. From its unassuming beginnings, the Web has picked up so much momentum that it is now “cracking the barrier of the largest information collection ever assembled by humans,” according to Brewster Kahle, president of Alexa Internet. It follows that the Web has also become the largest, most complex search space ever.

As the Web continues its relentless growth, search tools must evolve and adapt to remain useful. Trying to make exact predictions about the state of Web search in say, five years or so, is a fool’s game. But there are some clear trends and emerging standards that will play a definite role in shaping the future of Web search. In this article, we’ll provide a “big picture” overview of the future of Web search, and share some of the thinking of a number of key players who will play a major role in the evolution of Web search tools.

A Fundamental Megatrend: Convergence

“Currently, search is simply bad.” This isn’t a flame from a disgruntled Web searcher, but rather a mea culpa from Joel Truher, vice president of technology for HotBot, an architect of one of the most highly respected Web search engines. “It’s like interacting with a snotty French waiter.

The service is bad, you get served things you didn’t ask for, you often have to order again and again, and you can’t get things that are listed on the menu. People have learned to cope with it-they’ve internalized their frustration.”

In the early months of 1999, the phrase “Web search” generally means a keyword search on some kind of index or directory of textual documents. When we search, we still most commonly think of finding information on Web “pages,” each having discrete URLs (Uniform Resource Locators, or “addresses”). Our primary metaphors to represent the Web are books, magazines, or even filing cabinets stuffed with documents.

As we move into the next century, improvements in computer processing and storage capabilities, together with the stunning increase in available bandwidth for data communications, means we’ll be seeing far more types of media on the Web. And increasingly, information will be kept in databases that serve content dynamically, rather than stored as static Web pages. This trend will be accelerated by the “convergence” we’re already starting to see in a variety of areas.

Convergence is occurring through the:

Mergers between telephone and cable television companies. From a network standpoint, the distinctions between data, voice communications, video, and audio are being obliterated. Bottlenecks and technological impediments will no longer restrict the types of information that can be put on the Web.

Hybridization of network-aware devices. Computers are gaining television and telephony features. Mobile telephones will soon be able to receive email or even browse the Web. All manner of formerly unconnected devices, from cars to spy satellites to microwave ovens will be connected to the Web-and will have a certain degree of search capability, as well as being “searchable” themselves in the context of their specific function.

Integration of Web services into virtually every kind of software application and, perhaps controversially, even into the core of operating systems.

Though we’ve already seen explosive growth on the Web, it’s just beginning. Web publishing continues to get easier, and the continued influx of new users who all want to “homestead” their own piece of the Web will likely push the number of documents into the billions during the next decade. And with so many new types of information that can be put online, available through so many different devices, the whole concept of “Web search” will necessarily change dramatically. “The world of classic search is in a serious problem state,” according to Alexa’s Kahle. “Clutter on the Web is going to put the final nail in the coffin for traditional search.”

Search Engine…Portal… Destination

This doesn’t mean the generalpurpose search services will disappear-far from it. However, the major search engines, cum portals, are dealing with a seemingly curious paradox. Though search remains an important part of the service they provide to users, the competitive realities of the Web reward sites that succeed in persuading users to linger, rather than clicking off to another site. Increasingly, rather than serving as a portal to other sites on the Web, the major search sites are striving to create “stickiness,” to become “destinations” themselves.

They’re doing this in part by trying to better anticipate the needs of their users. “For many people, search isn’t about finding things-it’s about getting things done,” said HotBot’s Truher. Rather than providing comprehensive search results for every query, the search service will try to attempt to understand user intent, and classify and arrange search results accordingly.

We’re already seeing this trend emerge. Most of the major search services now segment results lists into “recommended” links of some sort, followed by traditional results generated by keyword matching. “I suspect we’ll continue to have fulltext search engines, but we can expect that results that are returned for some popular topics may have nothing to do with words on a page,” said Danny Sullivan, editor of Search Engine Watch. “Instead, I think the full-text retrieval option will remain a backup to either preprogrammed results, or when other retrieval options, such as link popularity fail to provide good matches.”

The search services are not relying solely on their own indexes or directories to provide optimal results, either. Some, most notably Lycos, are purchasing other search services and incorporating them into “networks,” in an attempt to become one-stop information resources. Others are creating partnerships to provide specialized search results-AltaVista’s collection of search “gadgets,” for example, and Infoseek’s new incarnation is as the search service for Go.com. “Go is almost like a cable network, a best of breed for Web sites,” said Jennifer Mullin, director of search and directory for Infoseek.

Other emerging trends include:

Smarter spiders, the software that retrieves Web pages for indexing. “We’re working on significant improvements in intelligent spidering,” said Infoseek’s Mullin. Rather than blindly retrieving documents, smart spiders will be able to make some judgements about the relevance of a document and the quality of links that it contains. Poor quality or inappropriate documents will not be spidered, and therefore will not clutter up the index, ultimately meaning that search results will be “purer.”

The emergence of the “invisible Web.” There will be an explosion of “vertical” search sites, providing access to deep, tightlyfocused databases. “We’re going to have more specialty services for particular topics, either produced by newcomers, or by the major search services themselves,” said Sullivan of Search Engine Watch. “It’s a natural way to manage the rapidly growing Web.”

In many respects, these specialized search sites will resemble the offerings of the traditional database vendors such as Dialog, LEXIS-NEXIS, and others. However, given the economics of the Web, it’s unlikely they’ll be able to charge significant user fees to support themselves. One needs only look at the wealth of free information about equities and investing to see how this trend will play out. Only a few years ago, this information was scarce and very expensive. Now even general-purpose search sites offer high-quality investment research and free quotes through partnerships with content providers.

Improved query processing. “Excite is doing focused research on how to better interpret queries,” said Kris Carpenter, director, search services, Excite, Inc. At the basic level, this means determining if a user is looking for a quick answer or robust results, and analyzing the “spectrum of meaning” implied in a query. “The classic example is when a user searches for bond,”’ said Carpenter. “We have to ask ourselves, are they looking for information about financial bonds, chemical bonds, or even James Bond?” Understanding the complete spectrum of meaning implied in a query allows the search engine to offer refinement capabilities that will narrow the range of results to a manageable number that will all be relevant and, in theory, appropriate.

“We’re trying to interpret your query and guess what you’re trying to do versus what you’re trying to find,” said Robert Frasca, general manager of Lycos.com. “We’re working to establish a dialog to lead people down the road to an answer-almost like helping them navigate the branches of a huge tree of information.”

More advanced “personalization” capabilities. You can already change the layout and, to a certain degree, the contents of the interface you use for most major search services.

Personalization options will be extended into search preferences as well. For example, you will be able to select “preferred” sources of information, request specific types of content aggregation from a variety of favorite sources-even set your own Spam tolerance. “Why not allow the user to decide what level of Spam is appropriate?” asked HotBot’s Truher. “You can establish a range from relatively unfiltered results, optimizing recall at the expense of precision, or go for the terse option, returning only authoritative results.”

Personal Web Assistants

Setting basic preferences is a good first step in tailoring searches to the needs of individuals. Another emerging trend is the ability of software to observe the behavior of a user and begin to adapt itself to demonstrated preferences and needs. “Searchers are individuals and differ greatly in a variety of ways,” said Gary Culliss, chairman and co-founder of Direct Hit. “A woman searcher in Boston is different than a teenage boy in Miami. When a Boston searcher enters the request `Italian Restaurants,’ the results should be Boston listings. Similarly, if a teenage searcher enters `shoes,’ the search engine should present shoe sites chosen by other teenagers.”

“Software programs that understand a person’s needs and do the work while they sleep will become very popular,” according to Mahendra Vora, president and CEO of IntelliSeek. “Just as we get better at searching the more we do it, our search tools should get smarter, too. Intelligent agents will learn to understand your style of work, and let us teach them to be better searchers on our behalf.”

Indeed, some of the promises for machine intelligence considered unachievable only a few years ago are now within reach, thanks to Moore’s Law. With computer processing power doubling roughly every eighteen months, software employing computing-intensive Artificial Intelligence algorithms can now be deployed on desktop computers. Natural language interfaces will become more prevalent. Ask Jeeves has one of the best natural language parsers tied to a back-end database of “answers” operating today. It has also licensed this technology to other vendors. Just look at “Ask AltaVista a Question,” or make a query to Dell’s “Ask Dudley” technical support agent to see this in action.

“We can see an interface evolving into something more like an assistant, said HotBot’s Truher. “An assistant will be smarter than conventional search engines, and provide more valuable services to the user.”

Harnessing the Collective Brain

Software that observes user activity will also capture what Truher describes as “collective intent,” measuring how groups of people actually use the Web. Beyond a personalized interface like “My HotBot,” Truher envisions something that would tap into the expertise and experience of groups of people, creating “Our HotBot,” for example.

To an extent, this is already being done. Lycos was a pioneer in the arena of “collaborative filtering” with its WiseWire technology, now used to create Lycos Guides. Guides are essentially topic-specific directories, with sites ranked based on user evaluation. Sites that get many votes move to the top of the directory, while less popular sites move lower in the listing. Alexa incorporates a similar voting mechanism in its interface.

Another approach is taken by DirectHit, which measures which sites are most frequently selected from search results-a sort of “popularity engine.” According to Direct Hit’s Culliss, “Search engines will have to ‘learn’ what searchers want to see. The word matching of conventional search engines can’t distinguish between similar sites and doesn’t consider non-textual elements like branding, reputation, usability, etc. Editorially created indexes are better, but there are simply too many sites competing for top listings for an editorial staff to keep up. The best editorial staff is actually comprised of the searchers themselves. A search engine which is automatically organized by its searchers is the most scalable model for keeping up with the growth of the Web.”

Yet another approach strives to locate authoritative sources (“authorities”) on the Web, and use the information to compile relevance rankings. Google, a new search engine developed at Stanford, is on the forefront of this work. “Google measures link importance,” said Larry Page, co-founder of Google. A link to another Web page is almost like a citation in a book. Web page authors generally only create links to other pages they think are important. Web authors have created thousands of links to Yahoo!, for example, so Yahoo! would be considered an important site. Yahoo! becomes even more of an important site if lots of other pages with high importance point to it. “It’s almost like having a peer review process for Web pages,” according to Page.

The beauty of this method is that it’s virtually impossible for a Web page author to mislead the ranking system. It’s also highly scalable-as the Web grows, more links to important pages are created, providing a greater degree of certainty that an important page is in fact an authoritative source.

IBM’s “Clever” project, using some of Cornell professor Jon Kleinberg’s HITS (Hypertext-Induced Topic Search) research, is pursuing a similar approach.

All of these approaches will help improve the relevance of search results. Nonetheless, serious researchers will still want even more information about the validity and authority of Web resources. As Tim Berners-Lee puts it, we need an “Oh, Yeah?” button that answers the question, “why should I trust this?”

New Standards Will Help… Maybe

One of the most serious impediments to realizing quality search results is that the Web is largely unstructured and chaotic. The HTML (Hypertext Markup Language) used to author Web documents provides few mechanisms to create “metadata,” or “data about data.” Though HTML metatags can contain descriptive information about Web pages, they can also be used spuriously to “spamdex” the search indexes in an effort to achieve higher relevance in search queries. Additionally, “only 30 to 40 percent of Web publishers take advantage of metatags,” according to Excite’s Carpenter.

Proposals for metadata standards are abundant (see Jessica Milstead and Susan Feldman’s excellent overview in the January 1999 issue of ONLINE). The standard that seems most likely to achieve something close to universal adoption is RDF (Resource Description Framework), being developed by the W3C. Like many of the proposed standards, RDF uses the syntax of XML (Extensible Markup Language).

RDF holds promise in a number of areas. It will be useful for search engines in resource discovery and cataloging. Web authors will be able to use it to describe collections of pages that might make up a single logical document (much like pages make up chapters that comprise a complete book). RDF can also be used to express intellectual property rights and privacy preferences for Web sites.

‘XML could turn out to be a Holy Grail, in a lot of ways,” said Lycos’ Frasca. “It will certainly help with searching and classifying Web sites.” The goal of all metadata standards proposals is to go beyond machinereadable data and create machineunderstandable data on the Web. This is a major step toward developing what Tim Berners-Lee calls a “semantic Web,” where assertions about Web resources can be expressed in a way that will allow “reasoning engines” to answer questions about them. Beyond keyword or phrase matching, reasoning engines will essentially be able to find logical proofs, reducing ambiguity, and leading to much more dependable search results.

Though the standard will provide a structure for describing, classifying, and managing Web documents, it has its own set of vulnerabilities, and not everyone is sanguine about its prospects.

“We’re not too optimistic about XML,” said Google’s Page. “Any information in a Web page that’s not visible to users is hard to trust. XML gives you more opportunities to lie.” Page believes that a better approach is to improve data extraction techniques from the visible portions of the page itself. “We can do a pretty good job of extracting meaning without the need for tags,” he said, noting that this approach would also eliminate the need for Web authors to learn how to use markup languages.

RDF will provide the most benefit to sites that can maintain control over the quality and integrity of the metadata authoring process. Specialized search services will be able to impose rules on Web authors that will require a greater level of internal consistency and validity in documents than exists today.

For general search services, the benefits of metadata are more problematic. “While there’s much hope held out for metadata provided through XML, I think the promise will fall short unless a trusted third-party begins evaluating and tagging sites,” said Search Engine Watch’s Sullivan. “Search engines don’t trust userprovided metadata, and for good reason-many site owners will lie or twist facts if they think it will help them rank better.”

Alternatively, metadata could realize its potential if a widely accepted certification authority gains popularity. To be certified, Web authors would be required to adhere to specific guidelines, similar to the privacy policy guidelines administered by TRUSTe.

The Browser is Dead-Long Live the Browser

Today, most people search the Web using their favorite browser. A searcher calls up the home page of a search service, and enters query information into a form, then submits the search. Results are presented in the browser window, which facilitates easy retrieval with a simple mouse click.

There are several problems with this approach. First, if a discovered resource is valuable, the searcher will probably want to revisit it more than once. Primitive “bookmark” or “favorites” schemes help, but rapidly become unmanageable due to the sheer quantity of favorites that accumulate over time, and the maddening habit of Web page authors to unthinkingly change the URL (address) of their pages, rendering a bookmark worthless. Browsers do nothing to help automate the process of revisiting sites on a regular basis. And finally, browsers require additional software “plug-ins” to retrieve rich media resources that aren’t HTML text or simple Web graphics.

Over the next few years, the browser-based monopoly on search will gradually erode. We’re already seeing an explosion in offline agents that work independently of browsers. These agents, such as BullsEye Pro from IntelliSeek and Mata Hari from The WebTools Company, offer numerous advantages over a browser/search form-based approach. They are typically metasearch agents with the ability to query hundreds of indexes, even including parts of the “invisible Web.” They can download pages to a computer, which not only makes subsequent access faster, but also makes pages available without an Internet connection or if the page URL has changed.

With offline search agents, the browser is still an important accessory to view documents, but no longer plays the primary role in searching. Despite the introduction of various “smart searching” features by both Netscape and Microsoft, other trends are even more threatening to the primacy of the browser.

For example, BullsEye Pro and Enfish Technologies’ Tracker Pro are including built-in viewers for multiple content types. They are also obliterating the somewhat artificial boundaries that currently separate Web search from a search of your desktop or a search of an internal or external network. All are valid repositories of information-being able to search them simultaneously with your own criteria lends a valuable framework of context to the results. “What’s related” gains personal significance.

Beyond that, both Microsoft and Apple are “webifying” their operating systems, enabling virtually any application to access, search, and interact with the Web. Microsoft, with its dominance in both operating system and application software, is adding Web retrieval and publishing capabilities to most of its products, including the entire Office suite of products. It’s Microsoft’s classic “embrace and extend” strategy, treating the Web as simply another network.

Apple is taking a different, somewhat more cautious approach. Apple’s Sherlock is the first step in this direction. Sherlock is a metasearch agent that has replaced the “finder” in the most recent releases of the Mac OS. Sherlock searches either a local hard disk or network, or the Web using a variety of major search services. Results are integrated, sorted by relevance rather than location.

Sherlock is quite adept at handling natural language queries. It is also expandable. Developers can create plug-ins tailored for specific Web sites. “Rather than simply webifying everything, arbitrarily putting Webbased interfaces into our products, we’re trying to provide seamless, intuitive access to information services,” said Peter Lowe, Apple’s director of worldwide product marketing, Mac OS. “Sherlock provides easy access to all search resources at a user’s disposal.”

The trend toward convergence noted earlier will also give rise to more varied delivery methods for search results. Based on personal profiles you create, a single search process might deliver results to multiple sources. Telephone numbers would be delivered to your cell phone. Maps, corporate background data, schedules, or other information of use to professionals could be delivered to a palm-held computer. Detailed research data could be delivered by email to a computer network, automatically massaged, formatted, and packaged for distribution, and routed automatically to a recipient list. Music or radio broadcasts could be delivered directly to a digital playback device.

In many areas, especially for repetitive tasks, search will become more of an automated, proactive process. Well likely see a resurgence of push technologies, but they’ll be smarter. Content will indeed be pushed out to your various Web-connected devices, but it will be content that matches highly personalized requirements.

Nonetheless, on a fundamental level, search will still be about locating and retrieving textual documents. Like its entrenched forebears, the word processor and the spreadsheet, the browser has proven its utility as a satisfactory window to the Web, and will likely remain an important tool for searchers, in spite of the innovations described here.

And Looking Into the Blue Sky…

If you look really far out into the future, you’ll begin to see fundamental changes in the way human beings interact with data itself. Beyond what we know today as search “results,” it’s likely that representations of information will become much richer, multisensory “whole brain” experiences. For example, a search result may display a visual map of a Web “environment” that looks like a topographic map. Documents or other resources that are similar to one another will be clustered into “hills,” with contour lines indicating the degree of conceptual proximity to other content hills. Cartia’s ThemeScape is an example of an application that does this type of conceptual mapping of a body of documents.

Object relationship mapping, illustrating relationships and dependencies among and between data, will likely become more commonplace. A great example of this is Plumb Design’s Visual Thesaurus. Rather than showing a simple list of synonyms and antonyms, the Visual Thesaurus creates a map that displays the strength of relationships between words. Users can also interactively choose to seek words based on their parts of speech by simply clicking on an image of a dial.

Languages like VRML (Virtual Reality Modeling Language) will make it possible for Web authors to cluster “universes” of information into a three-dimensional search space. This would give the searcher an experience similar to wandering around in a real-world library, without the physical constraints, and with the ability to manipulate, extract, and combine different types of data without regard to type, structure, or medium. This sort of experience will be somewhat akin to using the “holodeck” of the “Star Trek” television series, except that it won’t be science fiction.

The Future Isn’t What It Used To Be

“We’re very excited about the future,” said Excite’s Carpenter. “Search was once a very isolated experience. It has now crossed into a zone of an information service that can be applied across multiple subjects. It’s opening up the Web to a more mainstream audience. Search used to be for geeks-now it’s becoming broadly universal.”

Alexa’s Kahle agrees. “The Web is becoming a day-to-day resource for many people. With twenty Web searches happening for every one person that enters a library in a day, the Web is starting to bypass the importance of all libraries put together.”

Enthusiasm for the future is broadly universal among the architects and engineers who are building next-generation Web search tools. Their collective ideas, plans, and projects epitomize the most prescient vision of the future possible. In the words of Apple Fellow Alan Kay, “The best way to predict the future is to invent it.”