ALL >> Computers >> View Article
Open Source Enterprise Search With Arch Search Engine
"Put the two words "intranet search" in the Google search box and what do you get? The very first link is titled, "Why intranet search fails: Gerry McGovern".
This is how our first article on Arch "Corporate Search: Can We Just Get Google?" starts. This statement is no longer quite true. At the time of writing, at least in Australia, the first link is titled, "Arch Intranet Search Engine". We hope this is an indication that Arch is making a difference in this area. Here we discuss some of the key features of Arch and show how these allow efficient and effective intranet search in enterprise environments.
In the first article, we explained why searching intranets is a difficult problem, and offered a solution. Briefly, the method used by Google, based on web links statistics, gives excellent results on the global web, but this approach does not work for intranets, since intranet web links do not ...
... give enough statistical information to estimate the "quality" of a document. To find out which web pages are most relevant to the searcher, Arch uses a different source of statistical information that is available on intranets: it estimates relative document quality based on access frequency which it gets from web servers logs.
Enterprise environments have complex and substantial intranets. For such environments, the challenge of providing search services becomes non-trivial and there are many requirements that must be met, in addition to search precision and quality. The challenges are:
1. Large scale: an enterprise intranet can have multiple web servers, with millions of documents residing on them. An enterprise search engine has to be able to efficiently index and search huge volumes of information.
2. Access control: it must be possible to control who can find what. People not authorised to see restricted documents must not see the entries in any search results.
3. Organisational complexity and decentralisation: enterprises may have organisational units that function relatively autonomously. For example, a unit can have its own web server or intranet managed by an IT team. An enterprise search engine should allow decentralised control of data by the curators.
4. Topological complexity and distribution: in terms of networks, enterprise space can be very complex. It can consist of multiple clusters located remotely from each other and separated by firewalls. An enterprise search engine must be able to function in these conditions.
5. Data heterogeneity: in enterprise environments, search engines must be able to read a large range of data formats. It is also essential to be able to retrieve data that are stored in a range of locations, such as databases and data portals, as well as directly on web servers
We now discuss how Arch provides solutions to all of these requirements.
Scalability
Arch performs indexing using the open source package, Apache Nutch, which has been designed to be able to crawl and index the whole web. On the search side, Arch uses Apache Solr, which excels in efficiency and scalability. Based on these packages, Arch is able to efficiently index and search an intranet of any size. Arch also allows the use of partitioning for more efficient crawling. Multiple areas can be configured and these can be crawled at different frequencies, depending on requirements, such as how often they are updated and their size. Arch is not only able to index intranets of any size, but does this extremely efficiently.
Access control
Arch supports document-level access control, so that it is possible to precisely define the access to a particular document. In the simplest case, this can remove the need to run two separate search engines: a public one and an intranet one. Arch can index everything in a single index and then present different views to public and staff. More generally, Arch can easily define what group of users can see a set of documents residing in a given folder and its subfolders.
Organisational complexity and decentralisation
Arch was designed with search hosting in mind: it can be used to host search services, with clients managing their partitions completely independently and transparently, unaware of each other. It supports an unlimited number of light-weight configurable gateways that can narrow search to a particular area and search criteria, and present custom views of information, as well as enforce custom access control.
Topological complexity and distribution
The Arch crawler supports common authentication schemes, and can crawl password protected remote areas. Accessing logs of remote web servers presented a problem until recently, but this has recently been solved in Arch version 1.42. Our solution for this is to use a log processor that is deployed at a remote location. This processes locally available logs and produces results in form of a Sitemap file which is compressed and encrypted. This file is then accessed by the Arch crawler.
Data heterogeneity
Using Apache Solr as the index server, Arch can index practically anything that can be presented as attribute-value pairs encoded in XML. It comes with a few pre-built modules that can handle almost all types of data formats, and new modules are not hard to write. Thus, Arch is not limited to indexing web documents only, it can index practically anything.
Conclusions
Arch provides a powerful and efficient enterprise search engine that more than meets all of the critical enterprise search service requirements. In addition to this, Arch and its main components, Nutch and Solr, are highly modular and extensible, allowing for easy implementation of custom solutions. Arch is provided as free open source software, giving you and your organisation the full power of modification and customisation to best suit your requirements.
Add Comment
Computers Articles
1. Rental Management Software: A Complete Solution For Car, Property, And Coworking SpaceAuthor: RentAAA
2. The Ai Revolution: What’s Coming In 2025
Author: Ben Gross
3. The Rising And Falling Trends Of Graphic Card Prices In 2024
Author: Alahdeen
4. What Is Test-driven Development And Which Three Rules Does It Follow?
Author: Byteahead
5. What Is Web Application Architecture?
Author: goodcoders
6. Understanding How Wifi Works: The Wireless Connection Process Explained
Author: Kr
7. What’s Coming In Cybersecurity For 2025?
Author: Ben Gross
8. Hire Magento Expert In India
Author: Yuvraj Raulji
9. Discovering Everything About C15 Power Cables
Author: Jennifer Truong
10. Want To Get Long-distance Power? Time To Grab Extension Power Cords
Author: Jennifer Truong
11. Best Android Development Tools To Use
Author: Best Android Development Tools To Use
12. Choosing The Right Kansas City Web Design Partner For Your Business Success
Author: naviworld1h
13. The Importance Of Choosing The Right Kansas City Ecommerce Developer And Logo Design Expert
Author: naviworld1h
14. Top Mobile App Companies And Developers In Kansas City
Author: naviworld1h
15. Boost Your Business With A Leading Web Design Company In Kansas City
Author: naviworld1h