123ArticleOnline Logo
Welcome to 123ArticleOnline.com!
ALL >> Computers >> View Article

Open Source Enterprise Search With Arch Search Engine

Profile Picture
By Author: Arkadi Kosmynin
Total Articles: 3
Comment this article
Facebook ShareTwitter ShareGoogle+ ShareTwitter Share

"Put the two words "intranet search" in the Google search box and what do you get? The very first link is titled, "Why intranet search fails: Gerry McGovern".


This is how our first article on Arch "Corporate Search: Can We Just Get Google?" starts. This statement is no longer quite true. At the time of writing, at least in Australia, the first link is titled, "Arch Intranet Search Engine". We hope this is an indication that Arch is making a difference in this area. Here we discuss some of the key features of Arch and show how these allow efficient and effective intranet search in enterprise environments.



In the first article, we explained why searching intranets is a difficult problem, and offered a solution. Briefly, the method used by Google, based on web links statistics, gives excellent results on the global web, but this approach does not work for intranets, since intranet web links do not ...
... give enough statistical information to estimate the "quality" of a document. To find out which web pages are most relevant to the searcher, Arch uses a different source of statistical information that is available on intranets: it estimates relative document quality based on access frequency which it gets from web servers logs.



Enterprise environments have complex and substantial intranets. For such environments, the challenge of providing search services becomes non-trivial and there are many requirements that must be met, in addition to search precision and quality. The challenges are:

1. Large scale: an enterprise intranet can have multiple web servers, with millions of documents residing on them. An enterprise search engine has to be able to efficiently index and search huge volumes of information.

2. Access control: it must be possible to control who can find what. People not authorised to see restricted documents must not see the entries in any search results.

3. Organisational complexity and decentralisation: enterprises may have organisational units that function relatively autonomously. For example, a unit can have its own web server or intranet managed by an IT team. An enterprise search engine should allow decentralised control of data by the curators.

4. Topological complexity and distribution: in terms of networks, enterprise space can be very complex. It can consist of multiple clusters located remotely from each other and separated by firewalls. An enterprise search engine must be able to function in these conditions.

5. Data heterogeneity: in enterprise environments, search engines must be able to read a large range of data formats. It is also essential to be able to retrieve data that are stored in a range of locations, such as databases and data portals, as well as directly on web servers
We now discuss how Arch provides solutions to all of these requirements.



Scalability



Arch performs indexing using the open source package, Apache Nutch, which has been designed to be able to crawl and index the whole web. On the search side, Arch uses Apache Solr, which excels in efficiency and scalability. Based on these packages, Arch is able to efficiently index and search an intranet of any size. Arch also allows the use of partitioning for more efficient crawling. Multiple areas can be configured and these can be crawled at different frequencies, depending on requirements, such as how often they are updated and their size. Arch is not only able to index intranets of any size, but does this extremely efficiently.



Access control



Arch supports document-level access control, so that it is possible to precisely define the access to a particular document. In the simplest case, this can remove the need to run two separate search engines: a public one and an intranet one. Arch can index everything in a single index and then present different views to public and staff. More generally, Arch can easily define what group of users can see a set of documents residing in a given folder and its subfolders.



Organisational complexity and decentralisation



Arch was designed with search hosting in mind: it can be used to host search services, with clients managing their partitions completely independently and transparently, unaware of each other. It supports an unlimited number of light-weight configurable gateways that can narrow search to a particular area and search criteria, and present custom views of information, as well as enforce custom access control.



Topological complexity and distribution



The Arch crawler supports common authentication schemes, and can crawl password protected remote areas. Accessing logs of remote web servers presented a problem until recently, but this has recently been solved in Arch version 1.42. Our solution for this is to use a log processor that is deployed at a remote location. This processes locally available logs and produces results in form of a Sitemap file which is compressed and encrypted. This file is then accessed by the Arch crawler.



Data heterogeneity



Using Apache Solr as the index server, Arch can index practically anything that can be presented as attribute-value pairs encoded in XML. It comes with a few pre-built modules that can handle almost all types of data formats, and new modules are not hard to write. Thus, Arch is not limited to indexing web documents only, it can index practically anything.



Conclusions



Arch provides a powerful and efficient enterprise search engine that more than meets all of the critical enterprise search service requirements. In addition to this, Arch and its main components, Nutch and Solr, are highly modular and extensible, allowing for easy implementation of custom solutions. Arch is provided as free open source software, giving you and your organisation the full power of modification and customisation to best suit your requirements.

Total Views: 191Word Count: 918See All articles From Author

Add Comment

Computers Articles

1. Few Good Insights To Follow With Pc Gaming In Australia!
Author: Jack Williams

2. Transform Your Online Store With Australia's Leading Ecommerce Developers
Author: themerchantbuddy

3. How To Choose The Right Technology For Your mobile App?
Author: goodcoders

4. The Rise Of User Centered Web Design
Author: goodcoders

5. Reasons Why Laravel Perfect For Web Development?
Author: goodcoders

6. Ssd Vs Sas Vs Sata Drives: Which Is Better For Your Dedicated Server Hardware?
Author: The CyberTech

7. Raid Servers And Data Protection: Common Myths About Raid Servers
Author: The CyberTech

8. Top 8 Do's And Don’ts When Dealing With A Corrupted Sd Card
Author: The CyberTech

9. Nvme Vs Ssd: What To Choose For Your Storage Solutions?
Author: The CyberTech

10. 8 Common Data Recovery Myths Exposed!
Author: The CyberTech

11. Understanding Ssd Lifespan: Signs, Durability, Data Recovery, And Factors Affecting The Life Of An Ssd
Author: The CyberTech

12. Server Data Recovery Solutions: When Your Raid Server Is Crashed!
Author: The CyberTech

13. Data Recovery Solutions For Undetected Ssd On Bios
Author: The CyberTech

14. Problems Faced By Mobile Phone Users: Green Line Issue, Motherboard Failure, Phone Stuck On Logo And Mobile Data Recovery Possibilities
Author: The CyberTech

15. Ssd Vs Hdd: Weaknesses, Data Recovery Factors And Failure Rates
Author: The CyberTech

Login To Account
Login Email:
Password:
Forgot Password?
New User?
Sign Up Newsletter
Email Address: