May 2007 — News
Print this article | Email this articleClick here to receive your FREE subscription to T.H.E. Journal
Review: Google Mini 2.2
- Specify URLs and URL patterns to crawl (or restrict);
- Set a crawl schedule (continuous or fixed-duration), crawler access (user names and passwords for sites), proxy servers, and HTTP headers/agent name;
- Prevent recrawling on duplicate hosts;
- Create rules for identifying document dates;
- Set a host load schedule (for concurrent crawls), including downtimes;
- Index rollback (reverting to automatically generated index snapshots);
- Force reindexing ("freshness tuning");
- The creation of collections (setting and restricting URL patterns for collections);
- Creation of custom front ends;
- Creation of OneBox modules (described above);
- Crawl status reports, diagnostics, and queues;
- Statistics on mime types crawled;
- Serving and system status;
- Search reports;
- Search and event logs;
- E-mail notifications; and
- Miscellaneous administration features, such as LDAP configuration, SSL settings, certificate authorities, SNMP configuration, etc.


Hardware
Of course, the Google Mini isn't all software. It's a complete hardware/software appliance. But you won't read too much about the system's hardware features. For one thing, they're almost irrelevant. This box provides al the inputs you need to get the job done (including a monitor port, two RJ-45 jacks, and various other types of data connections, seen below).

As far as the guts are concerned, that information is just plain unavailable. Google says the machine runs on "standard" PC hardware and won't say anything else about it. I suppose I could eventually take a blow torch to my unit to find out what's inside it, but, for now, I prefer to leave it in pristine condition.
But the real question about the hardware anyway is whether or not it has the muscle power to do what it's supposed to do. And the answer is yes. It'll pull up results in a fraction of a second; it'll generate reports quickly (although you might need to refresh the browser manually to see the finished reports in a reasonable amount of time); and it can crawl with multiple concurrent connections. Using four concurrent connections, I was experiencing two to 16 pages crawled per second, with an average of about seven pages per second.