Search Engine is a software that allows the display of relevant webpage results based on the search query input by the use of Web Crawling and Web Indexing, some fat formula and intelligent algorithms in order to gather the appropriate data.
A few thousand searches were made in the time this webpage got loaded on your computer. But, does this ever stimulated your neurons, how a search engine works?
How Google serves you the best results at a blink of an eye? Actually, it doesn’t matter until Google, Bing are there. The scenario would’ve been very different if there was no Google, Bing, or Yahoo. Let us dive into the world of search engines and see, how a search engine works.
Peeping into the history
The search engine fairy tale began in 1990s when Tim Berners-Lee used to enlist every new webserver which went online, to the list maintained by the CERN webserver. Until September, 93, no search engines existed on the internet but only a few tools which were capable of maintaining a database of file names. Archie, Veronica, Jughead were the very first entrants in this category.
Oscar Nierstrasz from the University of Geneva is accredited for the very first search engine that came into existence, named W3Catalog. He did some serious Perl scripting and finally came out with the world’s first search engine on September 3, 1993. Furthermore, the year 1993 saw the advent of many other search engines. JumpStation by Jonathon Fletcher, AliWeb, WWW Worm, etc. Yahoo! was launched in 1995 as web-directory, but it started using Inktomi’s engine search from 2000 and then shifted to Microsoft’s Bing in 2009.
Now, talking about the name which is the prime synonym for the term search engine, Google Search, was a research project for two Stanford graduates, Larry Page and Sergy Brin, having its initial foot prints in March, 1995. Google’s working was initially inspired by Page’s back-linking method which did calculations based on how many backlinks originated from a webpage, so as to measure the importance of that page in the World Wide Web. “The best advice I ever got”, Page said, while he recalled, how his supervisor Terry Winograd supported his idea. And since then, Google never looked back.
It all begins with a crawl
A baby search engine in its nascent stage begins exploring the World Wide Web, with its small hands and knees it explores every other link it finds on a webpage and stores them in its database.
Now, let’s focus on some behind the scene technical thoughts, a search engine incorporates a Web Crawler software which is basically an internet bot assigned the task to open all the hyperlinks present on a webpage and create a database of text and metadata from all the links. It begins with an initial set of links to visit, called Seeds. As soon as it proceeds with visiting those links, adds new links in the existing list of URLs to visit, known as Crawl Frontier.
As the Crawler traverses through the links, it downloads the information from those web pages to be viewed later in the form of snapshots, as downloading the whole webpage would require a whole lot of data, and it comes at a pocket burning price, atleast in countries like India. And I can bet, if Google was founded in India, all their money would be used to pay the internet bills. Hopefully, that’s not a topic of concern as of now.
The Web crawler explores the web pages based on some policies:
Selection Policy: Crawler decides which pages it should download and which it shouldn’t. The selection policy focuses on downloading the most relevant content of a web page rather than some unimportant data.
Re-Visit Policy: Crawler schedules the time when it should re-open the web pages and edit the changes in its database, thanks to the dynamic nature of the internet which makes it very hard for the Crawlers to remain updated with the latest versions of the webpages.
Parallelization Policy: Crawlers use multiple processes at once to explore the links known as Distributed Crawling, but sometimes there are chances that different processes may download the same web page, so the crawler maintains a co-ordination between all the processes to eliminate any chances of duplicity.
Politeness Policy: When a crawler traverses a website, it simultaneously downloads web pages from it, thus increasing the load on webserver hosting the website. Hence, a term “Crawl-Delay” is implemented in which the crawler has to wait for a few seconds after it downloads some data from a webserver, and is governed by the Politeness Policy.
High-level Architecture of a standard Web Crawler:
The above illustration depicts how a web crawler works. It open the initial list of links and then links inside those links and so on.
Wikipedia writes, computer science researchers Vladislav Shkapenyuk and Torsten Suel noted that:
While it is fairly easy to build a slow crawler that downloads a few pages per second for a short period of time, building a high-performance system that can download hundreds of millions of pages over several weeks presents a number of challenges in system design, I/O and network efficiency, and robustness and manageability.