« Local Search - Past, Present, and Future | Main | Search Engines Explained (Part 2) »

June 06, 2005

Search Engines Explained (Part 1)

We have all heard of them, we use them everyday, but do you really know how a Search Engine (SE) works?  Understanding how a SE finds and processes web pages, what goes on when someone types in a search query, and how those results are displayed is crucial information for your business.

This week I am going to explain each of these important processes and how you can use this information to your advantage.

The Crawling Search Engine

This is the type of SE that we normally think of when we hear the term Search Engine.  Other types include Directories (Yahoo and DMOZ) and Metacrawlers (Dogpile and Momma) which I will discuss in future posts.

The crawling Search Engine has to perform three basic tasks:

  1. Finding web pages and storing their content
  2. Analyzing the page content
  3. Processing searchers queries

1. Finding and Storing Web Page Content

Search Engines are going to find your new web page in one of three ways links from an existing indexed web page, manual submission to the SE, and a relatively new method called XML data feed.

Having a link to your new web page from an existing web page that the SE already knows to exist is the best method for most businesses.  When the SE comes across this link to a new web page, it implies that someone else has found your web page important enough to link to.

Pro Tip: When I create a new web site for a client, I immediately place a link to the web site in my portfolio page.  Since my site is searched often by the SE, this new link will be found and can be scheduled for indexing very quickly.  If you developed your web site yourself, then have someone who is already in the SE link to your new site.

As far as manual submission of your web pages, most SE professionals doubt the reliability of this method.  In the early days of SE, this was a great way for them to find and index new sites.  But as spammers have flooded the SE with free submissions, SE either have ignored this method or have turned to a paid submission process.

The XML data feed method, such as Yahoo! Site Match system, allows web sites to submit new content for crawling and indexing in a special XML-based format.  This is a great method of getting your new content known to the SE, and will be a subject of a future post.

Once a SE's crawler (or spider) has found your web page, a copy is made and stored in their database.  If the page is already in HTML format, then the storage happens immediately.  If the page is in some other format, such as PDF or Microsoft Word, then the SE will convert the page into an HTML equivalent.

2. Analyzing the Page Content

Once your web page has been found and stored, the SE will inspect every word and tag and translate it into a mathematical representation in its database  This process is different for each SE and it is strictly confidential.  If this translation were ever to be released, then SE marketers could change the structure of their web pages in order to artificially inflate the page's relevance and ranking.

The complexity of the analysis, as well as the different processes used by the different SE, make a detailed discussion on this topic difficult.  What is important to know is at this point the SE do not look at the real web pages, they look at these mathematical representations when they match the searchers keywords to the documents in the database.

One more point I'd like to make.  When the SE analyzer comes across a link, the SE will feed that link back into its scheduler program to have its crawler visit that page. 

Pro Tip: To find out which of your pages have been found by Google, type the following into Google's search field: site:YourDomainName.com

3. Processing Searchers Queries

A discussion of Information Retrieval Theory is well beyond the scope of this blog.  But there are some very important issues that you need to be aware of.

Everyone is familiar with the plain text-based search feature of SE.  Here you would type into the search field something like "public golf courses in New Jersey".  But SE are advancing into many different areas of search results.  For example, type in your telephone number and see what comes back.  Even FedEx tracking numbers are interpreted by the SE giving you just the information you are searching for.

This concept, called semantic analysis, which tries to determine the searchers intention when typing in search keywords is an area that all major SE are competing in.  Each SE wants to deliver the best results to its visitors or risk losing that business to a competitor.

But the SE have to be careful.  Some SE track searching habits and use that information to improve their service.  This comes very close to privacy infringement.  Some users won't mind being tracked if their search results improve, but a lot of other people think that it's none of the SE's business to track their search history.  For more information, read this article that was posted June 3rd on CNN.

In the next blog post, we'll look at how the SE return the results in the way that they do.

Posted by Mark Beck on June 6, 2005 | Permalink

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a00d8341c5bee53ef00d83448515f53ef

Listed below are links to weblogs that reference Search Engines Explained (Part 1):

Comments

Mark, I enjoyed the article. Very informative, and didn't even leave me scratching my head :-) ... The "Pro tips" included were nice reminders of things I'd heard of in the past, but let slip my mind. again thanks for the informative read.

p.s. where you mention xml datafeed, would that be what google is doing with their "Google Sitemaps"? Or is that a whole other thing all together?

Posted by: Josh Hinds | Jun 6, 2005 11:41:09 PM

Hi Josh,

Thank you for the kind words, I appreciate them.

You are exactly right with the new Google Sitemaps program. Google is now allowing web site owners to create a custom XML-based file that resides on your server. This will be great for those sites who have hard to find content, such as dynamically created pages from a Content Management System (CMS) or an e-commerce shopping cart.

For more information on the data feed, including how to create and submit them to Google, visit this page: https://www.google.com/webmasters/sitemaps/login

Posted by: Mark Beck | Jun 7, 2005 8:54:45 AM

Post a comment