Welcome to 123ArticleOnline.com!
ALL >> Computers >> View Article

Open Source Enterprise Search With Arch Search Engine

By Author: Arkadi Kosmynin
Total Articles: 3
Comment this article

"Put the two words "intranet search" in the Google search box and what do you get? The very first link is titled, "Why intranet search fails: Gerry McGovern".

This is how our first article on Arch "Corporate Search: Can We Just Get Google?" starts. This statement is no longer quite true. At the time of writing, at least in Australia, the first link is titled, "Arch Intranet Search Engine". We hope this is an indication that Arch is making a difference in this area. Here we discuss some of the key features of Arch and show how these allow efficient and effective intranet search in enterprise environments.

In the first article, we explained why searching intranets is a difficult problem, and offered a solution. Briefly, the method used by Google, based on web links statistics, gives excellent results on the global web, but this approach does not work for intranets, since intranet web links do not ...
... give enough statistical information to estimate the "quality" of a document. To find out which web pages are most relevant to the searcher, Arch uses a different source of statistical information that is available on intranets: it estimates relative document quality based on access frequency which it gets from web servers logs.

Enterprise environments have complex and substantial intranets. For such environments, the challenge of providing search services becomes non-trivial and there are many requirements that must be met, in addition to search precision and quality. The challenges are:

1. Large scale: an enterprise intranet can have multiple web servers, with millions of documents residing on them. An enterprise search engine has to be able to efficiently index and search huge volumes of information.

2. Access control: it must be possible to control who can find what. People not authorised to see restricted documents must not see the entries in any search results.

3. Organisational complexity and decentralisation: enterprises may have organisational units that function relatively autonomously. For example, a unit can have its own web server or intranet managed by an IT team. An enterprise search engine should allow decentralised control of data by the curators.

4. Topological complexity and distribution: in terms of networks, enterprise space can be very complex. It can consist of multiple clusters located remotely from each other and separated by firewalls. An enterprise search engine must be able to function in these conditions.

5. Data heterogeneity: in enterprise environments, search engines must be able to read a large range of data formats. It is also essential to be able to retrieve data that are stored in a range of locations, such as databases and data portals, as well as directly on web servers
We now discuss how Arch provides solutions to all of these requirements.

Scalability

Arch performs indexing using the open source package, Apache Nutch, which has been designed to be able to crawl and index the whole web. On the search side, Arch uses Apache Solr, which excels in efficiency and scalability. Based on these packages, Arch is able to efficiently index and search an intranet of any size. Arch also allows the use of partitioning for more efficient crawling. Multiple areas can be configured and these can be crawled at different frequencies, depending on requirements, such as how often they are updated and their size. Arch is not only able to index intranets of any size, but does this extremely efficiently.

Access control

Arch supports document-level access control, so that it is possible to precisely define the access to a particular document. In the simplest case, this can remove the need to run two separate search engines: a public one and an intranet one. Arch can index everything in a single index and then present different views to public and staff. More generally, Arch can easily define what group of users can see a set of documents residing in a given folder and its subfolders.

Organisational complexity and decentralisation

Arch was designed with search hosting in mind: it can be used to host search services, with clients managing their partitions completely independently and transparently, unaware of each other. It supports an unlimited number of light-weight configurable gateways that can narrow search to a particular area and search criteria, and present custom views of information, as well as enforce custom access control.

Topological complexity and distribution

The Arch crawler supports common authentication schemes, and can crawl password protected remote areas. Accessing logs of remote web servers presented a problem until recently, but this has recently been solved in Arch version 1.42. Our solution for this is to use a log processor that is deployed at a remote location. This processes locally available logs and produces results in form of a Sitemap file which is compressed and encrypted. This file is then accessed by the Arch crawler.

Data heterogeneity

Using Apache Solr as the index server, Arch can index practically anything that can be presented as attribute-value pairs encoded in XML. It comes with a few pre-built modules that can handle almost all types of data formats, and new modules are not hard to write. Thus, Arch is not limited to indexing web documents only, it can index practically anything.

Conclusions

Arch provides a powerful and efficient enterprise search engine that more than meets all of the critical enterprise search service requirements. In addition to this, Arch and its main components, Nutch and Solr, are highly modular and extensible, allowing for easy implementation of custom solutions. Arch is provided as free open source software, giving you and your organisation the full power of modification and customisation to best suit your requirements.

Total Views: 294Word Count: 918See All articles From Author

Add Comment

Computers Articles

1. What Identity Governance Really Means In Modern Enterprises
Author: Mansoor Alam

2. Strategies For Successful Site Selection In Clinical Trials
Author: Giselle Bates

3. Simplifying Business Purchases With Smart, Reliable Procurement Solutions
Author: suma

4. How Businesses In Dubai Are Scaling Faster With Modern Erp Software
Author: Al murooj solutions

5. How To Choose The Right Weapon Tracking System: 7 Must-have Features
Author: 3PL Insights

6. Power Bi Tutorial For Beginners: Learn Business Intelligence Step By Step
Author: Tech Point

7. Spark Matrix™: Data Governance Solutions
Author: Umangp

8. How Prediction Market Software Development Is Transforming Data-driven Decision Making
Author: david

9. Naming Development & Management
Author: brainbell10

10. Mysql Database Development & Management Services
Author: brainbell10

11. Mongodb Development & Management
Author: brainbell10

12. Spark Matrix™: Conversational Automation
Author: Umangp

13. How Care Home Software Helps Improve Daily Operations In Care Homes
Author: Centrim Life UK

14. Pc & Tech Stores: Latest Trends In Hardware And Accessories
Author: Jack Williams

15. The Infozed Blueprint: Powering The Modern Workspace
Author: suma