The Crash-course to OpenFTS
---------------------------------

MOTIVATION:

  Current document is devoted to novices whose interests are quick 
  installation, testing and playing around to get feeling . Assuming you 
  already have all prerequisities installed a whole process should takes 
  about 2 minutes.

  OpenFTS is based on quite complex algorithms from Information Retrieval
  and Database theory. It's intended to be flexible. OpenFTS Primer 
  describes installation, running, API and should be used to write 
  your own search applications. 

  After completing tests you're welcome to read README.INSIDE for comments
  on the examples scripts.

PREREQUISITIES:

 Postgresql 7.4 or above + contrib/tsearch2 module  (7.3.X is also works)
 OpenFTS v.0.40 - available from http://openfts.sourceforge.net
 Perl modules: DBI, DBD-Pg, Time::HiRes - available from CPAN (http://www.cpan.org)
   DBI         - http://search.cpan.org/search?dist=DBI
   DBD::Pg     - http://search.cpan.org/search?dist=DBD-Pg
   Time::HiRes - http://search.cpan.org/search?dist=Time-HiRes 

 Test collection of documents is available for download from
 http://openfts.sourceforge.net/test-collections/apod-en.tar.gz
 Download and install the collection somewhere:
   cd /path/to/test-collection/
   tar xzvf apod-en.tar.gz
 Now you should have test documents in /path/to/test-collection/apod directory.

    APOD stands for the Astronomy Picture of the Day
   ( http://antwrp.gsfc.nasa.gov/apod/ ). Authors have kindly
   granted permission to use texts for testing and non-commercial purposes
   in framework of OpenFTS project.
 
   APOD collection is consists of 1757 articles (about 7 Mb) and ideally suited
   for OpenFTS. Indexing tooks about 29 seconds on my IBM ThinkPad T21 notebook
   ( Linux, 2.4.17, 256 Mb RAM, 20 Gb IDE HD). Total number of lexems is
   131310, while the number of unique lexemes is only 8,806 
  ( using Porter's stemmer ).

  Demo is available from http://xware.astronet.ru/db/apod.html

Make sure you have enough rights to create database. 
Now you may note the time !

RUNNING:

1. createdb openfts

   Create test database

2. psql openfts < /path/to/share/contrib/tsearch2.sql

   Load functions. Usually, if you postgresql is installed in 
   /usr/local/pgsql directory, these sql files should be in 
   /usr/local/pgsql/share/contrib directory.

4. ./init.pl openfts drop

   Drop previous openfts instance if any

5. ./init.pl openfts

   Create openfts instances (tables) in database

6. find /path/to/test-collection/apod -type f | ./index.pl openfts

   index APOD collection
   Resulting database occupies about 21 Mb on my notebook.

7. ./search.pl -p openfts supernovae stars
   
   Output should looks like a string with document identificators
   separated by semicolon:

   Found documents:118
   573;1241;1419;828;879;1629;553;795;740;1533;....

8. ./search.pl -p openfts -h5   supernovae stars
   
    Show text fragments of the first 5 matched files with hilighted 
    query terms. 
    ( It's possible to specify offset and limit in form of
      -h offset-limit, i.e., -h 5-10 )

9. Benchmarking. 

   ./search.pl -p openfts -b 100 supernovae stars
   Found documents:118, total time (100 runs):4.19, average time: 0.042 sec

   (Keep in my mind these numbers are for my notebook, your mileage may vary)

--------------------------------------------------------------------

PS.

 1) A list of unique lexems indexed with their frequencies could be obtained 
    using following command:

     psql -d openfts -qt -c "select * from stat('select fts_index from txt')\
                             order by ndoc desc, nentry  desc,word"

    Total number of lexems:

     psql -d openfts -qt -c "select count(*) from stat('select fts_index from txt')"
   
  2) We use Porters stemming algorithm in this example, but I
     highly recommend Snowball algorithm (http://snowball.tartarus.org).
     You'll need to install our perl interface to snowball which could be 
     downloaded from http://openfts.sourceforge.net/contributions.shtml
     Snowball stemmer is a high quality stemmer and available for many
     languages.
 

--------------------------------------------------------------------
Sat Aug  2 23:08:10 MSD 2003
Comments to Oleg Bartunov <oleg@sai.msu.su>
