News Release

Search engines biased, out-of-date, and index no more than 16% of the web

Peer-Reviewed Publication

NEC Research Institute

A new NEC Research Institute study analyzes the accessibility and distribution of information on the web. The study was conducted by Dr. Steve Lawrence and Dr. C. Lee Giles and will appear in the July 8 issue of the journal Nature.

-- LOW COVERAGE -- Search engine coverage has decreased substantially since Dec. 97, with no engine indexing more than about 16% of the publicly indexable web.

-- UNEQUAL ACCESS -- Search engines are more likely to index sites that have more links to them (more 'popular' sites). They are also typically more likely to index US sites than non-US sites, and more likely to index commercial sites than educational sites.

-- OUT-OF-DATE -- Indexing of new or modified pages by just one of the major search engines can take months.

-- AMOUNT OF INFORMATION -- The publicly indexable web contains about 800 million pages encompassing about 15 terabytes of data (about 6 terabytes of textual content after removing HTML tags, comments, and extra whitespace); it also contains about 180 million images.

-- TYPE OF INFORMATION -- 83% of sites contain commercial content and 6% contain scientific/educational content. Only 1.5% of sites contain pornographic content.

The web is transforming society, and the search engines are an important part of the process. For example, consumers use search engines to locate and buy goods or to research many decisions (such as choosing a vacation destination, medical treatment or election vote).

Search engine indexing and ranking may have economic, social, political, and scientific effects. For example, indexing and ranking of online stores can substantially effect economic viability; delayed indexing of scientific research can lead to the duplication of work or slower progress; and delayed or biased indexing may affect social or political decisions.

One of the great promises of the web is to equalize access to information. As the web fast becomes a major communications medium, attention should be paid to the accessibility of information on the web, in order to minimize unequal access to information, and maximize the benefits of the web for society.

For more information see http://wwwmetrics.com.

###

The NEC Research Institute conducts long-term, fundamental research in computer and physical sciences. The mission of the Institute is to contribute significant new understanding of computer and communication (C&C) technologies for the future. Institute research activities have a long-term goal of significant advances in the understanding of intelligence and information processing in biological and machine systems, and in the physical and system aspects of future computer architectures.



Disclaimer: AAAS and EurekAlert! are not responsible for the accuracy of news releases posted to EurekAlert! by contributing institutions or for the use of any information through the EurekAlert system.