Web Search and OAI Protocol (reading notes)

Web Search Engines Pt.1


Again the advanced developments of computer software and applications have enabled companies like Microsoft, Yahoo and Google to peruse the vastness of the internet and to increase the efficiency of their service. I was particularly impressed about the capabilities of the Web-scale crawlers. If what the article explained is just a simple description of the average crawler technology I could only imagine how powerful Google’s web crawlers are. It was also interesting to see how spam technology was adapting to the changing ways in which crawlers were detecting them. The thought of everything reminds me of the Matrix movies.

Web Search engines Pt.2

The second part of the series examines how algorithms and data structures index 400 terabytes of Web page text and deliver the best results in response to hundreds of millions of queries each day. The capabilities of these indexers are just as impressive as the crawlers they interact with, in terms of the magnitude of work they do. One issue the article mentioned that struck me was the impracticality of some of the tasks and how solutions were found to deal with these impractical problems. For instance the author spoke about the Page Rank computation and the impracticality of processing matrices of rank 20 billion. However, I wonder if the solutions researchers arrived at were really alternative ways of dealing with the problem or short-term solutions which did not deal with the computational impracticality. Do these special indexing tricks sacrifice arriving at a complete solution for the sake of inducing a more rapid response when I query is submitted? This leads to the issue of whether search engine companies such as Google, Yahoo and Microsoft are really delivering on their promise of reliable queries in the fastest possible time.

The Deep Web: Surfacing Hidden Value

This author talks about the immeasurable importance of the “Deep Web” and the information it stores. It seems like being able to crawl the “Deep Web” and respond to queries, offering information from there, is the next ‘gold mine’ for search engine companies. However I wonder why many of these websites exist in the ‘Deep Web.”
Do the administrators of these websites even want the information on their databases being broadly accessible? I was surprised at the names I saw on the “deep web” list such as Amaszon.com and ebay.com. However considering the article was written in 2001, it is apparent these sites became part of the “surface web”sometime after. I agree, though, with the author’s conclusions that search engines would become more specialized in terms of where they will focus their search technology: on “surface web” or “deep web.” If according to the previous articles that major search engines are still grappling with mastering the indexing of the “surface web” it seems unlikely that they will be able to index the “deep web” in a massive way also.

 




Comments