Offshore Software Development from Russia  
english versionjapanese versiongerman versionrussian version
 
software development services
 


Projects


    Projects






Get a Quote!



backup to CD With Handy Backup you can backup to CD, DVD, on LAN, to remote FTP and use Online Handy Backup Service.

password manager Handy Password is a password manager that keeps your login infornation in perfect security.

RTF TO XML Convert your RTF to PDF, RTF to HTML, RTF to XSL FO with RTF TO XML converter

automated testing tool Testing Master is automated testing tool which helps you automated test web site for performance and to find bottlenecks easy and effective.

backup to cd With Novosoft Office Backup you can easily backup to CD-R/RW all your important and valuable information.

Freeware & Shareware software Are you looking for Email, Servers, tools, utilities?

Remote Data Backup guaranties you a high security level of your data. We transfer data using 128-bit encryption and store it on protected data center for Remote Data Storage.

Password Management Directory Submit your password software here and gain extra exposure for your products and services!

Backup Schedule All about backup - a bank of backup articles, backup reviews, backup news. Decide which backup variety best suits you.

Backup Utilities Backup to HD with the best backup software to keep your data files save!





© 2000-2016 Novosoft LLC.



  A large-scale search engine

Custom Project Name

A large-scale full-text indexing search engine

Customer (country)

An American leading provider of Internet services

Business Case

With the current scale and growth of the World Wide Web, the importance of being able to search for and locate Web pages effectively and accurately is of prime importance. Currently, the only feasible way a searcher can locate a particular Web-based source is by using a Web search engine. The generic large-scale search engines return results in the thousands, many of which lack relevance to the query; but searchers only tend to look at the first few results anyway, hence an accurate rank is critically important.

The customer has decided to build a Web-scale search engine without problems of currently existing search systems. The goal of the project is to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth. Creating a search engine, which scales even to today's web presents many challenges, fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and the documents themselves.

The system has to keep local copies of documents retrieved from the Internet and has to have access to fast data storage. Full size of the document repository that contains all information about web pages including document header, archived document body, etc. is estimated as dozens terabytes.

Solution

Despite the importance of the large-scale search engines on the web, very little academic research has been done on them. Novosoft spent over 1000 man-hours investigating the issues of calculating relevance for large volume of data. As a result, Novosoft has designed an architecture that can support novel research activities on large-scale web data. The system is based on the following major concepts:

Distributed data processing
Due to the very fast Internet growth it is almost impossible to keep up-to-date index and perform thousands of query operations per seconds with one centralized server. To solve this problem Novosoft has developed distributed data processing technology. Each service can run on a dedicated computer as well as share a computer with another service. So, all components are isolated from each other and can be reused. Due to isolation of the components, this architecture also improves the project maintainability.

Search by meaning feature
Generally, this means the following search scheme: First, user enters a word to search. Then, search engine tries to find an entry for this word in a dictionary and along with the standard search result generates a list of possible meanings of the word entered. Then, if the user selects any particular meaning, system generates a refined search request, which consists of the word entered by user and synonyms obtained from the dictionary. This scheme guaranties that while performing refined search, the system will select desired URLs first.

Multi-core server
That feature allows one system server to use several independent indexes. This feature might be useful for a wide range of applications, like for indexing many different independent sites from one-time search.

Scalability
Search engine can be scaled to any target system, from desktops to high-end computers. This is provided by distributed data processing architecture shown on screenshot below.

The solution consists of the following components:

  • URL Server
  • Crawler
  • Search engine
    • Indexer
    • Search Engine core
  • WEB front-end
  • Thesaurus
  • Thesaurus Editor

URL Server handles information about all documents or, rather, URLs of documents in the system. It manages a simple, but very efficient URL database (that component called URL Repository, but it is a part of implementation of URL Server) based on hash tables with high performance rehashing algorithm.

The purpose of the Crawler is to retrieve documents from Internet and put them into the Indexer. Each Crawler keeps up to 20 connections open in each of the 20 open threads at the same time. This is necessary to retrieve web pages at high speed. At peak speed, the system can crawl over 100 web pages per second. To increase Crawler performance thread-pool technique has been used.

Search engine consists of two logical modules: Indexer and Search engine core (see screenshot below). The first one manages Indexer database, the second - processes search requests. The Indexer is also responsible for re-indexing new documents retrieved by Crawler.

The Indexer database is a simple, high-performance database intended to keep close to one million records.

WEB front-end for the system is implemented in JSP. This module communicates with two others: Search engine core and Thesaurus. HTTP server is embedded in WEB front-end component, so no HTTP daemons are needed to start to use the system.

The purpose of Thesaurus is to provide meanings and synonyms for a given word, and to store relations between words. This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results. Two dictionaries were implemented: gate to WordNet® and own Custom Dictionary. WordNet® is an on-line lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory.

Thesaurus Editor provides web front-end for managing Thesaurus and to edit Custom Dictionary.

Screenshot(s) High-level System Architecture
Tools and Technologies

JAVA, JSP, TCP/IP
WordNet® lexical database (by Cognitive Science Laboratory)
Linux

Benefits
The customer received a system corresponding to the highest market requirements. The functionality of the designed system is on the same level with the world leading search engines, and the following facts show the advantages:
The search engine can index up to 50,000,000 web pages
Each Crawler instance process 20-30 web pages per second
Each Indexer instance process 10-20 web pages per second
The search system process a query faster than in 1 second on the index of 1 billion documents


We'd love to hear from you and find out how we can help.
Email us, or use our online form to ask any question.


     Internet Services projects
     E-services projects
     Web Development projects
     Java projects