Since the site’s inception, we’ve been massing large amounts of content on which millions of people have come to depend. We have numerous ways of getting to the content, but the quickest and easiest way to find specific information is to search for it.
There are, however, a couple of caveats with Microsoft Full Text search. The first is that it throws errors when your search criteria contain “noise words”. By default, Full Text search is configured with a list of “noise words”. Microsoft (and many other search engines) consider words like “because,been,before,being,between,both,but,by” to be common words that should not be contained in an index. Of course, you can trap this error easily in your application, but realistically, the search engine should just filter the words out of the search phrase itself.
The second and more important issue is how Full Text handles acronyms and numerical values in search strings. We never really did get to the bottom of the problem, but even with all of the noise words removed from Full Text, certain search phrases that contained acronym and numerical data wouldn’t return results. Since our data is full of technical acronyms and numerical model numbers, this was a major issue for us.
AnandTech Search 1.0 (ColdFusion Verity)
The first version of the site used a search server included with ColdFusion named “Verity”. Most people have heard of Verity; they are one of the industry leaders in enterprise search software. The version of Verity that was included with ColdFusion back then was a light version of the full-blown Verity Search server. Although it did quite well at locating content via Boolean searches, it lacked flexibility and wasn’t all that of a performant.AnandTech Search 2.0 (Microsoft FullText Search)
After we migrated to Microsoft SQL Server, we decided to use the Full Text search that is built-in to SQL Server. SQL Server Full Text came to be in version 7.0, and allows you to create catalogs that can contain multiple indexes on text column types. You can then configure Full Text to index the data in the background, or perform one time or scheduled indexing of the data.There are, however, a couple of caveats with Microsoft Full Text search. The first is that it throws errors when your search criteria contain “noise words”. By default, Full Text search is configured with a list of “noise words”. Microsoft (and many other search engines) consider words like “because,been,before,being,between,both,but,by” to be common words that should not be contained in an index. Of course, you can trap this error easily in your application, but realistically, the search engine should just filter the words out of the search phrase itself.
The second and more important issue is how Full Text handles acronyms and numerical values in search strings. We never really did get to the bottom of the problem, but even with all of the noise words removed from Full Text, certain search phrases that contained acronym and numerical data wouldn’t return results. Since our data is full of technical acronyms and numerical model numbers, this was a major issue for us.
48 Comments
View All Comments
bellwether - Thursday, November 29, 2007 - link
This is a great starting point for search for small businesses. Google's algorithm is effective, but the problem is that the result page sends the user to the Google Mini itself (so they leave your website), and it is in Google's format. XSLT is supposed to help you modify this, but doesn't do that good of a job.This http://www.components4asp.net/GoogleMini/">custom google mini website search page has something for ASP.NET that lets you add in image thumbnails to the search result and integrate the search into a regular ASPX page that's part of your website. Plus, there's a 30 day free trial. Definitely worth taking a look at.
fzkl - Saturday, September 10, 2005 - link
The dell memory is probably used because it has life time warranty.mini - Friday, September 9, 2005 - link
What is the OS used in the Google Mini?Could you please post more administration snapshots?
Tks
Eirikur - Friday, September 9, 2005 - link
I suspect some of your problems with the Full Text Search feature of SQL server might be related to how it breaks text into words and sentences. The word breaker will break by punctuation which is horrendous when it comes to version numbers. The word breaker will look at a version number like "2.0" and decide that "2" and "0" are two separate words in different sentences. Then it will throw both away since it ignores single letter words. In a version number like "2.82.1" only "82" will get indexed.jberry - Wednesday, September 7, 2005 - link
Does anyone know how the Google mini counts the 100K page limit with dynamic websites??fishy - Wednesday, September 7, 2005 - link
So...When are going to overclock this thing?
ok, just k/d....
PassMark - Tuesday, September 6, 2005 - link
There are much cheaper solutions around that you can run on your existing hardware and have similar performance without a limit of 100,000 pages.e.g.
The http://www.wrensoft.com/zoom/">Zoom Search Engine for $99
http://www.wrensoft.com/zoom/">http://www.wrensoft.com/zoom/
Brickster - Tuesday, September 6, 2005 - link
I imagine there are certain documents that you would want only certain users to have access to. How do you control access to the documents that Google has indexed? Does it just return everything despite and document-specific, access security policy?Verdant - Tuesday, September 6, 2005 - link
hence why the pricetag on the big brother is next to useless imho...i can't see anyone using anything besides the mini for indexing something like a website, for knowledgebases and the like you need a lot more than just a way to search.
Brickster - Wednesday, September 7, 2005 - link
Dude, you should see our company. A full featured search engine alone based on Google would do wonders for our cess pool of organization that is our intranet and file servers. For some, that is enough.