Offshore Software Development from Russia  
english versionjapanese versiongerman versionrussian version
 
software development services
 


Projects


    Projects

    Business Domain

    Solution Type

    Technology Domain
Components & Libraries
Databases
Desktop Utilities
Drivers
Internet Protocols
Java
Lotus Notes/Domino
Multimedia
Mobile & Wireless
.Net
Security
Utilities
Web Development
Web Design




Get a Quote!



backup to DVD Handy Backup is a backup software that allows recording your backup to dvd discs, storing backup to FTP servers, and much more...

password manager Handy Password is a password manager that keeps your login infornation in perfect security.

RTF TO HTML The RTF to XML converter was designed for converting RTF to HTML files

Freeware & Shareware software SoftEmpire.com offers extensive collection of downloads and applications for pocket PC and mobile devices.

load stress testing Testing Master provides you with an easy-to-use and cost-effective way of load stress testing of web sites and intranet applications.

File backup software Novosoft Office Backup: backup of program settings and documents with an easy-to-use file backup software

Novosoft Remote Backup Service - backup and storage of important data on secure remote server.

Password Management Directory We strongly encourage you to submit password software, articles, reviews to Password Management Directory.

Backup Schedule All about enterprise backup - enterprise backup systems, enterprise network backup. The best way to protect your business.

Backup Utilities Backup to hard drive is the safest backup method. Choose appropriate backup software from Backup Utilities!





© 2000-2016 Novosoft LLC.




  A large-scale search engine

Custom Project Name

A large-scale full-text indexing search engine

Customer (country)

An American leading provider of Internet services

Business Case

With the current scale and growth of the World Wide Web, the importance of being able to search for and locate Web pages effectively and accurately is of prime importance. Currently, the only feasible way a searcher can locate a particular Web-based source is by using a Web search engine. The generic large-scale search engines return results in the thousands, many of which lack relevance to the query; but searchers only tend to look at the first few results anyway, hence an accurate rank is critically important.

The customer has decided to build a Web-scale search engine without problems of currently existing search systems. The goal of the project is to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth. Creating a search engine, which scales even to today's web presents many challenges, fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and the documents themselves.

The system has to keep local copies of documents retrieved from the Internet and has to have access to fast data storage. Full size of the document repository that contains all information about web pages including document header, archived document body, etc. is estimated as dozens terabytes.

Solution

Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. Novosoft spent over 1000 man-hours investigating the issues of calculating relevance for large volume of data. As a result, Novosoft has designed an architecture that can support novel research activities on large-scale web data. The system is based on the following major concepts:

Distributed data processing
Due to the very fast Internet growth it is almost impossible to keep up-to-date index and perform thousands of query operations per seconds with one centralized server. To solve this problem Novosoft has developed distributed data processing technology. Each service can run on a dedicated computer as well as share a computer with another service. So, all components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.

Search by meaning feature
Generally, this means the following search scheme: First, user enters a word to search. Then, search engine tries to find an entry for this word in a dictionary and along with the standard search result generates a list of possible meanings of the word entered. Then, if the user selects any particular meaning, system generates a refined search request, which consists of the word entered by user and synonyms obtained from the dictionary. This scheme guaranties that while performing refined search, the system will select desired URLs first.

Multi-core server
That feature allows one system server to use several independent indexes. This feature might be useful for a wide range of applications, like for indexing many different independent sites from one-time search.

Scalability
Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data processing architecture shown on screenshot below.

The solution consists of the following components:

  • URL Server
  • Crawler
  • Search engine
    • Indexer
    • Search Engine core
  • WEB front-end
  • Thesaurus
  • Thesaurus Editor

URL Server handles information about all documents or, rather, URLs of documents in the system. It manages a simple, but very efficient URL database (that component called URL Repository, but it is a part of implementation of URL Server) based on hash tables with high performance rehashing algorithm.

The purpose of the Crawler is to retrieve documents from Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance thread-pool technique has been used.

Search engine consists of two logical modules: Indexer and Search engine core (see screenshot below). The first one manages Indexer database, the second - processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler.

The Indexer database is a simple, high-performance database intended to keep close to one million records.

WEB front-end for the system is implemented in JSP. This module communicates with two others: Search engine core and Thesaurus. HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.

The purpose of Thesaurus is to provide meanings and synonyms for a given word, and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results. Two dictionaries were implemented: gate to WordNet® and own Custom Dictionary. WordNet® is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory.

Thesaurus Editor provides web front-end for managing Thesaurus and to edit Custom Dictionary.

Screenshot(s) High-level System Architecture
Tools and Technologies

JAVA, JSP, TCP/IP
WordNet® lexical database (by Cognitive Science Laboratory)
Linux

Benefits
The customer received a system corresponding to the highest market requirements. The functionality of the designed system is on the same level with the world leading search engines, and the following facts show the advantages:
The search engine can index up to 50,000,000 web pages
Each Crawler instance process 20-30 web pages per second
Each Indexer instance process 10-20 web pages per second
The search system process a query faster than in 1 second on the index of 1 billion documents


We'd love to hear from you and find out how we can help.
Email us, or use our online form to ask any question.


     Internet Services projects
     E-services projects
     Web Development projects
     Java projects