glimpse



       glimpse 4.1 - search quickly through entire file systems


OVERVIEW

       Glimpse  (which  stands  for  GLobal IMPlicit SEarch) is a
       very popular UNIX indexing and query  system  that  allows
       you  to  search through a large set of files very quickly.
       Glimpse supports most of agrep's  options  (agrep  is  our
       powerful  version  of grep) including approximate matching
       (e.g., finding misspelled  words),  Boolean  queries,  and
       even  some  limited  forms  of regular expressions.  It is
       used in the same way, except that you don't have to  spec-
       ify  file names.  So, if you are looking for a needle any-
       where in your file system, all  you  have  to  do  is  say
       glimpse needle and all lines containing needle will appear
       preceded by the file name.

       To use glimpse you first need to  index  your  files  with
       glimpseindex.   For example, glimpseindex -o ~  will index
       everything at or  below  your  home  directory.   See  man
       glimpseindex for more details.

       Glimpse is also available for web sites, as a set of tools
       called WebGlimpse.  (The old glimpseHTTP is no longer sup-
       ported      and      is     not     recommended.)      See
       http://glimpse.cs.arizona.edu/webglimpse/ for more  infor-
       mation.

       Glimpse  includes  all of agrep and can be used instead of
       agrep by giving a file name(s) at the end of the  command.
       This  will cause glimpse to ignore the index and run agrep
       as usual.  For example, glimpse -1  pattern  file  is  the
       same  as agrep -1 pattern file.  Agrep is distributed as a
       self-contained package within glimpse,  and  can  be  used
       separately.   We added a new option to agrep:  -r searches
       recursively the directory and  everything  below  it  (see
       agrep options below); it is used only when glimpse reverts
       to agrep.

       Mail glimpse-request@cs.arizona.edu to  be  added  to  the
       glimpse  mailing  list.   Mail  glimpse@cs.arizona.edu  to
       report bugs,  ask  questions,  discuss  tricks  for  using
       glimpse,  etc. (this is a moderated mailing list with very
       little traffic, mostly announcements).   HTML  version  of
       these  manual pages can be found in http://glimpse.cs.ari-
       zona.edu/glimpsehelp.html Also, see the glimpse home pages
       in http://glimpse.cs.arizona.edu/


SYNOPSIS

       glimpse - [almost all letters] pattern


INTRODUCTION

       We  start with simple ways to use glimpse and describe all
       saying

       glimpse pattern

       The output of glimpse is similar to that of agrep (or  any
       other  grep).   The pattern can be any agrep legal pattern
       including a regular expression or a Boolean  query  (e.g.,
       searching  for Tucson AND Arizona is done by glimpse 'Tuc-
       son;Arizona').

       The speed of glimpse depends  mainly  on  the  number  and
       sizes of the files that contain a match and only to a sec-
       ond degree on the total size of all indexed files.  If the
       pattern  is  reasonably uncommon, then all matches will be
       reported in a few seconds even if the indexed files  total
       500MB  or more.  Some information on how glimpse works and
       a reference to a detailed article are given below.

       Most of agrep (and other grep's)  options  are  supported,
       including approximate matching.  For example,

       glimpse -1 'Tuson;Arezona'

       will  output  all  lines containing both patterns allowing
       one spelling error in any of the patterns  (either  inser-
       tion,  deletion,  or  substitution), which in this case is
       definitely needed.

       glimpse -w -i 'parent'

       specifies case insensitive  (-i)  and  match  on  complete
       words  (-w).   So  'Parent' and 'PARENT' will match, 'par-
       ent/child' will match, but 'parenthesis' or 'parents' will
       not  match.  (Starting at version 3.0, glimpse can be much
       faster when these two options  are  specified,  especially
       for  very  large  indexes.   You  may want to set an alias
       especially for "glimpse -w -i".)

       The -F option provides a pattern that must match the  file
       name.  For example,

       glimpse -F '\.c$' needle

       will  find the pattern needle in all files whose name ends
       with .c.  (Glimpse will first check its index to determine
       which  files may contain the pattern and then run agrep on
       the file names to  further  limit  the  search.)   The  -F
       option should not be put at the end after the main pattern
       (e.g., "glimpse needle -F hay" is incorrect).


A Detailed Description of All the Options of Glimpse

       -#     # is an integer between 1 and 8 specifying the max-
              ally,  each  insertion,  deletion,  or substitution
              counts as one error.  It is possible to adjust  the
              relative  cost of insertions, deletions and substi-
              tutions (see -I -D  and  -S  options).   Since  the
              index  stores only lower case characters, errors of
              substituting upper case  with  lower  case  may  be
              missed  (see  LIMITATIONS).  Allowing errors in the
              match requires more time  and  can  slow  down  the
              match  by  a  factor  of 2-4.  Be very careful when
              specifying more than one error, as  the  number  of
              matches tend to grow very quickly.

       -a     prints  attribute  names.  This option applies only
              to Harvest SOIF structured data (used  with  glimp-
              seindex  -s).  (See http://harvest.transarc.com for
              more information about the Harvest project.)

       -A     used for glimpse internals.

       -b     prints the byte offset (from the beginning  of  the
              file)  of the end of each match.  The first charac-
              ter in a file has offset 0.

       -B     Best match mode.   (Warning:  -B  sometimes  misses
              matches.   It  is  safer  to  specify the number of
              errors explicitly.)  When -B is  specified  and  no
              exact  matches  are found, glimpse will continue to
              search until the closest matches  (i.e.,  the  ones
              with  minimum number of errors) are found, at which
              point the following message  will  be  shown:  "the
              best  match contains x errors, there are y matches,
              output them? (y/n)" This message refers to the num-
              ber  of  matches  found in the index.  There may be
              many more matches in the actual text (or there  may
              be  none  if -F is used to filter files).  When the
              -#, -c, or -l options are specified, the -B  option
              is  ignored.  In general, -B may be slower than -#,
              but not by very much.  Since the index stores  only
              lower case characters, errors of substituting upper
              case with lower case may  be  missed  (see  LIMITA-
              TIONS).

       -c     Display  only  the count of matching records.  Only
              files with count > 0 are displayed.

       -C     tells glimpse to send its queries to glimpseserver.

       -d 'delim'
              Define  delim  to  be  the  separator  between  two
              records.  The default value is '$', namely a record
              is  by  default  a  line.  delim can be a string of
              size at most 8 (with possible use of ^ and $),  but
              delim is considered as one record.  For example, -d
              '$$' defines paragraphs as records and -d  '^From '
              defines  mail messages as records.  glimpse matches
              each record separately.  This option does not  cur-
              rently  work  with  regular  expressions.   The  -d
              option  is  especially  useful  for   Boolean   AND
              queries,  because  the  patterns need not appear in
              the same line but in the same record.  For example,
              glimpse   -F   mail   -d   '^From '   'glimpse;ari-
              zona;announcement' will output  all  mail  messages
              (in  their  entirety) that have the 3 patterns any-
              where in the message (or the header), assuming that
              files  with  'mail' in their name contain mail mes-
              sages.  If you want the scope of the record  to  be
              the  whole  file, use the -W option.  Glimpse warn-
              ing: Use this option with care.  If  the  delimiter
              is  set  to  match  mail messages, for example, and
              glimpse finds the pattern in a regular file, it may
              not  find  the  delimiter and will therefore output
              the whole file.  (The -t option - see below  -  can
              be used to put the delim at the end of the record.)
              Performance Note: Agrep (and  glimpse)  resorts  to
              more  complex  search  when  the -d option is used.
              The search is slower and unfortunately no more than
              32 characters can be used in the pattern.

       -Dk    Set  the  cost  of a deletion to k (k is a positive
              integer).  This option does not currently work with
              regular expressions.

       -e pattern
              Same  as a simple pattern argument, but useful when
              the pattern begins with a `-'.

       -E     prints the lines in the index (as  they  appear  in
              the  index)  which  match the pattern.  Used mostly
              for debugging and maintenance of the  index.   This
              is not an option that a user needs to know about.

       -f file_name
              this  option has a different meaning for agrep than
              for glimpse: In glimpse, only the files whose names
              are  listed  in  file_name  are matched.  (The file
              names have to appear as in .glimpse_filenames.)  In
              agrep,  the file_name contains the list of the pat-
              terns that are searched.  (Starting at version 3.6,
              this  option  for  glimpse is much faster for large
              files.)

       -F file_pattern
              limits  the  search  to  those  files  whose   name
              (including  the  whole  path) matches file_pattern.
              large index.  If file_pattern matches a  directory,
              then  all  files  with this directory on their path
              will be considered.  To limit the search to  actual
              file  names,  use  $  at  the  end  of the pattern.
              file_pattern can be a regular expression and even a
              Boolean  pattern.   This  option  is implemented by
              running agrep file_pattern  on  the  list  of  file
              names  obtained from the index.  Therefore, search-
              ing the index itself takes the same amount of time,
              but limiting the second phase of the search to only
              a few files can speed up the search  significantly.
              For example,

              glimpse -F 'src#\.c$' needle

              will  search  for  needle  in all .c files with src
              somewhere along the path.  The -F file_pattern must
              appear  before  the  search  pattern (e.g., glimpse
              needle -F '\.c$' will not work).  It is possible to
              use  some  of  agrep's  options  when matching file
              names.  In this case all options  as  well  as  the
              file_pattern  should  be  in quotes.  (-B and -v do
              not work very well as part of a file_pattern.)  For
              example,

              glimpse -F '-1 \.html' pattern

              will  allow  one spelling error when matching .html
              to the file names  (so  ".htm"  and  ".shtml"  will
              match as well).

              glimpse -F '-v \.c$' counter

              will  search  for 'counter' in all files except for
              .c files.

       -g     prints  the  file  number  (its  position  in   the
              .glimpse_filenames file) rather than its name.

       -G     Output the (whole) files that contain a match.

       -h     Do not display filenames.

       -H directory_name
              searches for the index and the other .glimpse files
              in directory_name.  The default is the home  direc-
              tory.   This option is useful, for example, if sev-
              eral different indexes are maintained for different
              archives  (e.g.,  one  for  mail  messages, one for
              source code, one for articles).

       -i     Case-insensitive search -- e.g., "A"  and  "a"  are
              Performance Note: When -i is used together with the
              -w option, the search may become much  faster.   It
              is  recommended  to have -i and -w as defaults, for
              example, through an alias.  We  use  the  following
              alias in our .cshrc file
              alias glwi 'glimpse -w -i'

       -Ik    Set  the cost of an insertion to k (k is a positive
              integer).  This option does not currently work with
              regular expressions.

       -j     If  the  index  was constructed with the -t option,
              then -j will output  the  files  last  modification
              dates in addition to everything else.  There are no
              major performance penalties for this option.

       -J host_name
              used in conjunction with glimpseserver (-C) to con-
              nect to one particular server.

       -k     No symbol in the pattern is treated as a meta char-
              acter.  For example,  glimpse  -k  'a(b|c)*d'  will
              find  the  occurrences  of a(b|c)*d whereas glimpse
              'a(b|c)*d' will find substrings that match the reg-
              ular expression 'a(b|c)*d'.  (The only exception is
              ^ at the beginning of the pattern and $ at the  end
              of  the pattern, which are still interpreted in the
              usual way.  Use \^ or \$ if you  need  them  verba-
              tim.)

       -K port_number
              used in conjunction with glimpseserver (-C) to con-
              nect to one particular server at the specified  TCP
              port number.

       -l     Output  only  the files names that contain a match.
              This option differs from the -N option in that  the
              files  themselves  are  searched,  but the matching
              lines are not shown.

       -L x | x:y | x:y:z
              if one number is given, it is a limit on the  total
              number  of matches.  Glimpse outputs only the first
              x matches.  If -l is used (i.e.,  only  file  names
              are  sought),  then  the  limit is on the number of
              files; otherwise, the limit is  on  the  number  of
              records.  If two numbers are given (x:y), then y is
              an added limit on the total number  of  files.   If
              three numbers are given (x:y:z), then z is an added
              limit on the number of matches per file.  If any of
              the  x,  y, or z is set to 0, it means to ignore it
              (in other words 0 = infinity in  this  case);   for
              is  particularly  useful  for servers that needs to
              limit the amount of output provided to clients.

       -m     used for glimpse internals.

       -M     used for glimpse internals.

       -n     Each matching record  (line)  is  prefixed  by  its
              record  (line)  number  in  the  file.  Performance
              Note: To  compute  the  record/line  number,  agrep
              needs  to search for all record delimiters (or line
              breaks), which can slow down the search.

       -N     searches only the index (so the search is  faster).
              If  -o or -b are used then the result is the number
              of files that have a potential match plus a  prompt
              to  ask  if you want to see the file names.  (If -y
              is used, then there is no prompt and the  names  of
              the  files  will be shown.)  This could be a way to
              get the matching file  names  without  even  having
              access  to  the files themselves.  However, because
              only the index is searched, some potential  matches
              may  not  be real matches.  In other words, with -N
              you will not miss any file but you  may  get  extra
              files.   For example, since the index stores every-
              thing in lower case,  a  case-sensitive  query  may
              match  a  file  that  has  only  a case-insensitive
              match.  Boolean queries may match a file  that  has
              all the keywords but not in the same line (indexing
              with -b allows glimpse to figure  out  whether  the
              keywords  are  close, but it cannot figure out from
              the index whether they are exactly on the same line
              or in the same record without looking at the file).
              If the index was not build with -o or -b, then this
              option  outputs  the  number of blocks matching the
              pattern.  This is useful as an  indication  of  how
              long  the  search  will take.  All files are parti-
              tioned  into  usually  200-250  blocks.   The  file
              .glimpse_statistics  contains  the  total number of
              blocks (or glimpse -N a will  give  a  pretty  good
              estimate;  only  blocks  with no occurrences of 'a'
              will be missed).

       -o     the opposite of -t: the delimiter is not output  at
              the  tail,  but  at  the  beginning  of the matched
              record.

       -O     the file names are not printed before every matched
              record;  instead,  each  filename  is  printed just
              once, and all the matched  records  within  it  are
              printed after it.

              you  to utilize compressed `neighborhoods' (sets of
              filenames) to limit  your  search,  without  uncom-
              pressing  them.   Added mostly for WebGlimpse.  The
              usage is:
              "-p filename:X:Y:Z" where "filename"  is  the  file
              with  compressed neighborhoods, X is an offset into
              that  file  (usually  0,  must  be  a  multiple  of
              sizeof(int)),  Y  is the length glimpse must access
              from that file (if 0, then whole file;  must  be  a
              multiple of sizeof(int)), and Z must be 2 (it indi-
              cates that "filename" has the sparse-set  represen-
              tation  of compressed neighborhoods: the other val-
              ues are for internal use only). Note that any colon
              ":" in filename must be escaped using a backslash .

       -P     used for glimpse internals.

       -q     prints the offsets of the beginning and end of each
              matched  record.   The difference between -q and -b
              is that -b prints the offsets of the actual matched
              string,  while  -q  prints the offsets of the whole
              record where the match occurred.  The output format
              is  @x{y}, where x is the beginning offset and y is
              the end offset.

       -Q     when used together with -N glimpse  not  only  dis-
              plays  the filename where the match occurs, but the
              exact occurrences (offsets) as seen in  the  index.
              This option is relevant only if the index was built
              with -b;  otherwise, the offsets are not  available
              in the index.  This option is ignored when used not
              with -N.

       -r     This option is an  agrep  option  and  it  will  be
              ignored  in  glimpse, unless glimpse is used with a
              file name at the end which makes it run  as  agrep.
              If the file name is a directory name, the -r option
              will search (recursively) the whole  directory  and
              everything  below  it.  (The glimpse index will not
              be used.)

       -R k   defines the maximum size (in bytes)  of  a  record.
              The  maximum  value  (which is the default) is 48K.
              Defining the maximum to be lower than  the  deafult
              may speed up some searches.

       -s     Work  silently,  that  is,  display  nothing except
              error messages.  This is useful  for  checking  the
              error status.

       -Sk    Set  the  cost of a substitution to k (k is a posi-
              tive integer).  This option does not currently work
              is assumed to appear at  the  end  of  the  record.
              Glimpse  will  output  the record starting from the
              end of delim to (and  including)  the  next  delim.
              (See warning for the -d option.)

       -T directory
              Use  directory as a place where temporary files are
              built.   (Glimpse  produces  some  small  temporary
              files  usually  in  /tmp.)   This  option is useful
              mainly in the context of structured queries for the
              Harvest  project,  where the temporary files may be
              non-trivial, and the /tmp directory  may  not  have
              enough space for them.

       -U     (starting  at  version  4.0B1)  Interprets an index
              created with the -X or the -U option in  glimpsein-
              dex.   Useful  mostly for WebGlimpse or similar web
              applications.  When  glimpse  outputs  matches,  it
              will  display  the filename, the URL, and the title
              automatically.

       -v     (This option is an agrep  option  and  it  will  be
              ignored  in  glimpse, unless glimpse is used with a
              file name at the end which makes it run as  agrep.)
              Output  all  records/lines  that  do  not contain a
              match.  (Glimpse does not support the NOT  operator
              yet.)

       -V     prints the current version of glimpse.

       -w     Search  for  the  pattern  as  a word -- i.e., sur-
              rounded by non-alphanumeric characters.  For  exam-
              ple, glimpse -w car will match car, but not charac-
              ters and not car10.  The non-alphanumeric must sur-
              round the match;  they cannot be counted as errors.
              This option does not work with regular expressions.
              Performance Note: When -w is used together with the
              -i option, the search may become much faster.   The
              -w will not work with $, ^, and _ (see BUGS below).
              It is recommended to have -i and  -w  as  defaults,
              for  example, through an alias.  We use the follow-
              ing alias in our .cshrc file
              alias glwi 'glimpse -w -i'

       -W     The default for Boolean AND queries  is  that  they
              cover  one  record (the default for a record is one
              line) at a time.  For example,  glimpse  'good;bad'
              will  output  all  lines containing both 'good' and
              'bad'.  The -W option changes the scope of Booleans
              to  be  the whole file.  Within a file glimpse will
              output all matches to any  of  the  patterns.   So,
              glimpse  -W  'good;bad'  will output all lines con-
              used only with -W.  It is described later on.   The
              OR operator is essentially unaffected (unless it is
              in combination with the other Boolean  operations).
              For  structured  queries,  the  scope is always the
              whole attribute or file.

       -x     The pattern  must  match  the  whole  line.   (This
              option  is  translated  to  -w  when  the  index is
              searched and it is used only when the  actual  text
              is searched.  It is of limited use in glimpse.)

       -X     (from version 4.0B1 only) Output the names of files
              that contain a match even if these files have  been
              deleted  since  the  index was built.  Without this
              option glimpse will simply ignore these files.

       -y     Do not prompt.  Proceed with the match  as  if  the
              answer  to  any prompt is y.  Servers (or any other
              scripts) using glimpse will probably  want  to  use
              this option.

       -Y k   If  the  index  was constructed with the -t option,
              then -Y x will output only matches  to  files  that
              were  created  or  modified within the last x days.
              There are no major performance penalties  for  this
              option.

       -z     Allow   customizable   filtering,  using  the  file
              .glimpse_filters to  perform  the  programs  listed
              there  for  each  match.   The best example is com-
              press/decompress.  If .glimpse_filters include  the
              line
              *.Z   uncompress <
              (separated  by  tabs) then before indexing any file
              that matches the pattern "*.Z" (same syntax as  the
              one  for  .glimpse_exclude)  the  command listed is
              executed first (assuming input is from stdin, which
              is why uncompress needs <) and its output (assuming
              it goes to stdout) is indexed.  The file itself  is
              not  changed  (i.e., it stays compressed).  Then if
              glimpse -z is used, the same  program  is  used  on
              these  files  on  the fly.  Any program can be used
              (we run 'exec').  For example, one can  filter  out
              parts  of files that should not be indexed.  Glimp-
              seindex tries to apply all filters in .glimpse_fil-
              ters  in the order they are given.  For example, if
              you want to uncompress a file and then extract some
              part  of it, put the compression command (the exam-
              ple above) first and then another line that  speci-
              fies  the extraction.  Note that this can slow down
              the search because  the  filters  need  to  be  run
              before  files  are  searched.  (See also glimpsein-
              us.)

       The  characters  `$',  `^',  `*', `[', `]', `^', `|', `(',
       `)', `!',  and  `\'  can  cause  unexpected  results  when
       included  in  the  pattern,  as  these characters are also
       meaningful to the shell.  To avoid these problems, enclose
       the  entire pattern in single quotes, i.e., 'pattern'.  Do
       not use double quotes (").


PATTERNS

       glimpse supports a large variety  of  patterns,  including
       simple  strings,  strings with classes of characters, sets
       of strings, wild cards, and regular expressions (see LIMI-
       TATIONS).

       Strings
              Strings  are  any sequence of characters, including
              the special symbols `^' for beginning of  line  and
              `$' for end of line.  The following special charac-
              ters ( `$', `^', `*', `[', `^', `|', `(', `)', `!',
              and  `\' ) as well as the following meta characters
              special to glimpse (and agrep): `;', `,', `#', `<',
              `>',  `-',  and  `.',  should be preceded by `\' if
              they are to be matched as regular characters.   For
              example,  \^abc\\  corresponds to the string ^abc\,
              whereas ^abc corresponds to the string abc  at  the
              beginning of a line.

       Classes of characters
              a  list  of  characters inside [] (in order) corre-
              sponds to any character from the list.   For  exam-
              ple,  [a-ho-z]  is any character between a and h or
              between o and z.  The symbol `^' inside []  comple-
              ments  the  list.   For  example, [^i-n] denote any
              character in the character set except character 'i'
              to  'n'.  The symbol `^' thus has two meanings, but
              this is consistent  with  egrep.   The  symbol  `.'
              (don't  care) stands for any symbol (except for the
              newline symbol).

       Boolean operations
              Glimpse supports an `AND' operation denoted by  the
              symbol  `;' an `OR' operation denoted by the symbol
              `,', a limited version of a 'NOT' operation (start-
              ing at version 4.0B1) denoted by the symbol `~', or
              any    combination.     For    example,     glimpse
              'pizza;cheeseburger' will output all lines contain-
              ing   both   patterns.    glimpse   -F   'gnu;\.c$'
              'define;DEFAULT'  will  output all lines containing
              both 'define' and 'DEFAULT' (anywhere in the  line,
              not  necessarily in order) in files whose name con-
              tains 'gnu' and ends with  .c.   glimpse  '{politi-
              tion  works only together with the -W option and it
              is generally applies only to the whole file  rather
              to  individual  records.   Its output may sometimes
              seem counterintuitive.  Use with care.  glimpse  -W
              'fame;~glory'  will  output  all  lines  containing
              'fame' in all files that contain 'fame' but do  not
              contain  'glory';  This  is  the most common use of
              NOT,  and  in  this  case  it  works  as  expected.
              glimpse -W '~{fame;glory}' will be limited to files
              that do not contain both words, and will output all
              lines containing one of them.

       Wild cards
              The  symbol '#' is used to denote a sequence of any
              number (including 0) of arbitrary  characters  (see
              LIMITATIONS).   The symbol # is equivalent to .* in
              egrep.  In fact, .* will work too, because it is  a
              valid  regular  expression  (see below), but unless
              this is part of an  actual  regular  expression,  #
              will work faster.  (Currently glimpse is experienc-
              ing some problems with #.)

       Combination of exact and approximate matching
              Any pattern inside angle brackets <> must match the
              text exactly even if the match is with errors.  For
              example, <mathemat>ics  matches  mathematical  with
              one  error  (replacing  the  last s with an a), but
              mathe<matics> does not match mathematical no matter
              how many errors are allowed.  (This option is buggy
              at the moment.)

       Regular expressions
              Since the index is word based, a regular expression
              must  match  words  that  appear  in  the index for
              glimpse to find it.  Glimpse first strips the regu-
              lar  expression from all non-alphabetic characters,
              and searches the index for all remaining words.  It
              then  applies the regular expression matching algo-
              rithm to the files found in the index.   For  exam-
              ple,  glimpse  'abc.*xyz' will search the index for
              all files that contain both 'abc'  and  'xyz',  and
              then search directly for 'abc.*xyz' in those files.
              (If you use glimpse -w  'abc.*xyz',  then  'abcxyz'
              will  not be found, because glimpse will think that
              abc and xyz need to be  matches  to  whole  words.)
              The  syntax of regular expressions in glimpse is in
              general the same as  that  for  agrep.   The  union
              operation  `|', Kleene closure `*', and parentheses
              () are all supported.  Currently '+'  is  not  sup-
              ported.   Regular expressions are currently limited
              to approximately 30 characters (generally excluding
              meta  characters).   Some  options (-d, -w, -t, -x,
              ular expressions that use '*' or  '|'  is  4.  (See
              LIMITATIONS.)

       structured queries
              Glimpse  supports  some  form of structured queries
              using  Harvest's  SOIF  format.    See   STRUCTURED
              QUERIES below for details.


EXAMPLES

       (Run  "glimpse  '^glimpse' this-file" to get a list of all
       examples, some of which were given earlier.)

       glimpse -F 'haystack.h$' needle
              finds all needles in all haystack.h's files.

       glimpse -2 -F html Anestesiology
              outputs all occurrences of Anestesiology  with  two
              errors  in  files with html somewhere in their full
              name.

       glimpse -l -F '\.c$' variablename
              lists the names of all .c files that contain  vari-
              ablename  (the  -l  option  lists file names rather
              than output the matched lines).

       glimpse -F 'mail;1993' 'windsurfing;Arizona'
              finds all lines containing windsurfing and  Arizona
              in  all files having `mail' and '1993' somewhere in
              their full name.

       glimpse -F mail 't.j@#uk'
              finds all mail addresses (search  only  files  with
              mail  somewhere  in  their name) from the uk, where
              the login name ends with t.j, where  the  .  stands
              for  any  one  character.   (This is very useful to
              find a login name of someone whose middle name  you
              don't know.)

       glimpse -F mbox -h -G  . > MBOX
              concatenates  all  files  whose name matches `mbox'
              into one big one.


SEARCHING IN COMPRESSED FILES

       Glimpse includes  an  optional  new  compression  program,
       called  cast,  which  allows glimpse (and agrep) to search
       the compressed files without having  to  decompress  them.
       The search is actually significantly faster when the files
       are compressed.  However, we have not tested cast as thor-
       oughly  as we would have liked, and a mishap in a compres-
       sion algorithm can cause loss of data, so we recommend  at
       this  point to use cast very carefully.  We do not support
       or maintain cast.  (Unless you specifically use cast,  the
       All  files  used  by  glimpse  are  located  at the direc-
       tory(ies) where the index(es) is  (are)  stored  and  have
       .glimpse_    as   a   prefix.    The   first   two   files
       (.glimpse_exclude  and  .glimpse_include)  are  optionally
       supplied  by the user.  The other files are built and read
       by glimpse.

       .glimpse_exclude
              contains a  list  of  files  that  glimpseindex  is
              explicitly  told to ignore.  In general, the syntax
              of .glimpse_exclude/include is the same as that  of
              agrep  (or  any  other  grep).   The  lines  in the
              .glimpse_exclude  file  are  matched  to  the  file
              names,  and  if they match, the files are excluded.
              Notice that agrep matches to parts of  the  string!
              e.g.,  agrep  /ftp/pub will match /home/ftp/pub and
              /ftp/pub/whatever.  So,  if  you  want  to  exclude
              /ftp/pub/core,  you  just  list  it,  as is, in the
              .glimpse_exclude     file.      If     you      put
              "/home/ftp/pub/cdrom"  in  .glimpse_exclude,  every
              file  name  that  matches  that  string   will   be
              excluded,  meaning all files below it.  You can use
              ^ to indicate the beginning of a file name,  and  $
              to indicate the end of one, and you can use * and ?
              in the usual  way.   For  example  /ftp/*html  will
              exclude  /ftp/pub/foo.html,  but  will also exclude
              /home/ftp/pub/html/whatever;   if   you   want   to
              exclude  files  that  start  with /ftp and end with
              html use ^/ftp*html$ Notice that putting a * at the
              beginning  or  at the end is redundant (in fact, in
              this case glimpseindex will remove the  *  when  it
              does  the  indexing).  No other meta characters are
              allowed in .glimpse_exclude (e.g., don't use .*  or
              #  or |).  Lines with * or ? must have no more than
              30 characters.  Notice  that,  although  the  index
              itself  will not be indexed, the list of file names
              (.glimpse_filenames) will be indexed unless  it  is
              explicitly listed in .glimpse_exclude.

       .glimpse_filters
              See the description above for the -z option.

       .glimpse_include
              contains  a  list  of  files  that  glimpseindex is
              explicitly told to include in the index even though
              they  may look like non-text files.  Symbolic links
              are followed  by  glimpseindex  only  if  they  are
              specifically  included  here.  If a file is in both
              .glimpse_exclude and .glimpse_include  it  will  be
              excluded.

              contains  the  list  of all indexed file names, one
              per line.  This is an ASCII file that can  also  be
              used  with  agrep to search for a file name leading
              to a fast find command.  For example,
              glimpse 'count#\.c$' ~/.glimpse_filenames
              will output the names of  all  (indexed)  .c  files
              that have 'count' in their name (including anywhere
              on the path from the index).  Setting the following
              alias in the .login file may be useful:
              alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

       .glimpse_index
              contains the index.  The index consists  of  lines,
              each  starting  with  a  word followed by a list of
              block numbers (unless the  -o  or  -b  options  are
              used,  in  which  case  each word is followed by an
              offset into the file .glimpse_partitions where  all
              pointers  are  kept).   The  block/file numbers are
              stored in binary form, so  this  is  not  an  ASCII
              file.

       .glimpse_messages
              contains the output of the -w option (see above).

       .glimpse_partitions
              contains  the  partition  of the indexed space into
              blocks and, when the index is built with the -o  or
              -b  options,  some part of the index.  This file is
              used internally by glimpse and it  is  a  non-ASCII
              file.

       .glimpse_statistics
              contains  some  statistics  about the makeup of the
              index.  Useful for some advanced  applications  and
              customization of glimpse.

       .glimpse_turbo
              An added data structure (used under glimpseindex -o
              or -b only) that helps to speed up queries signifi-
              cantly  for  large  indexes.   Its  size is 0.25MB.
              Glimpse will work without it if needed.


STRUCTURED QUERIES

       Glimpse   can   search   for   Boolean   combinations   of
       "attribute=value"  terms  by using the Harvest SOIF parser
       library (in glimpse/libtemplate).  To search this way, the
       index  must be made by using the -s option of glimpseindex
       (this can be used in conjunction with  other  glimpseindex
       options).   For  glimpse  and  glimpseindex  to  recognize
       "structured" files, they must be in SOIF format.  In  this
       format,  each  value is prefixed by an attribute-name with
       the size of the value (in bytes) present in "{}" after the
       type{17}:       Directory-Listing
       md5{32}:        3858c73d68616df0ed58a44d306b12ba
       Any string can serve as an attribute name.  Glimpse  "pat-
       tern;type=Directory-Listing"  will  search  for  "pattern"
       only in files whose type is "Directory-Listing".  The file
       itself  is  considered to be one "object" and its name/url
       appears as the first attribute with an "@"  prefix;  e.g.,
       @FILE  {  http://xxx...  } The scope of Boolean operations
       changes from records (lines) to whole  files  when  struc-
       tured  queries are used in glimpse (since individual query
       terms can look at different attributes and they may not be
       "covered" by the record/line).  Note that glimpse can only
       search for patterns in the value parts of the  SOIF  file:
       there  are  some attributes (like the TTL, MD5, etc.) that
       are  interpreted  by  Harvest's  internal  routines.   See
       http://harvest.cs.colorado.edu/harvest/user-manual/    for
       more detailed information of the SOIF format.


REFERENCES

       1.     U. Manber and S. Wu, "GLIMPSE:  A  Tool  to  Search
              Through  Entire  File  Systems," Usenix Winter 1994
              Technical Conference (best paper award), San  Fran-
              cisco  (January  1994), pp. 23-32.  Also, Technical
              Report #TR 93-34, Dept. of Computer  Science,  Uni-
              versity of Arizona, October 1993 (a postscript file
              is available by anonymous ftp at  ftp://ftp.cs.ari-
              zona.edu/reports/1993/TR93-34.ps).

       2.     S.  Wu and U. Manber, "Fast Text Searching Allowing
              Errors," Communications  of  the  ACM  35  (October
              1992), pp. 83-91.


SEE ALSO

       agrep(1), ed(1), ex(1), glimpseindex(1), glimpseserver(1),
       grep(1), sh(1), csh(1).


LIMITATIONS

       The index of glimpse is word based.  A pattern  that  con-
       tains  more  than  one  word cannot be found in the index.
       The way glimpse overcomes this weakness  is  by  splitting
       any  multi-word  pattern into its set of words and looking
       for all of them in the index.  For example, glimpse  'lin-
       ear  programming' will first consult the index to find all
       files containing both linear  and  programming,  and  then
       apply agrep to find the combined pattern.  This is usually
       an effective solution, but it can be slow for cases  where
       both  words are very common, but their combination is not.

       As was mentioned in the section on  PATTERNS  above,  some
       characters  serve  as meta characters for glimpse and need
       to be preceded by '\' to search for them.  The most common
       examples  are  the characters '.' (which stands for a wild
       "glimpse ab*de" will not match ab*de, but "glimpse ab\*de"
       will.  The meta character - is translated automatically to
       a hypen unless it appears between []  (in  which  case  it
       denotes a range of characters).

       The  index  of  glimpse stores all patterns in lower case.
       When glimpse searches the index it first converts all pat-
       terns to lower case, finds the appropriate files, and then
       searches the actual files  using  the  original  patterns.
       So,  for example, glimpse ABCXYZ will first find all files
       containing abcxyz in any combination of  lower  and  upper
       cases, and then searches these files directly, so only the
       right cases will be found.  One problem with this approach
       is  discovering  misspellings  that  are  caused  by wrong
       cases.  For example, glimpse -B abcXYZ will  first  search
       the  index  for the best match to abcxyz (because the pat-
       tern is converted to lower case); it will find that  there
       are  matches with no errors, and will go to those files to
       search them directly, this time with  the  original  upper
       cases.   If  the closest match is, say AbcXYZ, glimpse may
       miss it, because it  doesn't  expect  an  error.   Another
       problem  is  speed.  If you search for "ATT", it will look
       at the index for "att".  Unless you use -w  to  match  the
       whole  word, glimpse may have to search all files contain-
       ing, for example, "Seattle" which has "att" in it.

       There is no size limit for simple patterns and simple pat-
       terns  within  Boolean expressions.  More complicated pat-
       terns, such as regular expressions, are currently  limited
       to approximately 30 characters.  Lines are limited to 1024
       characters.  Records are limited to 48K, and may be  trun-
       cated  if  they are larger than that.  The limit of record
       length  can  be  changed  by   modifying   the   parameter
       Max_record in agrep.h.

       Glimpseindex does not index words of size > 64.


BUGS

       In  some  rare cases, regular expressions using * or # may
       not match correctly.

       A query that contains no alphanumeric  characters  is  not
       recommended  (unless glimpse is used as agrep and the file
       names are provided).  This is an understatement.

       The notion of "match to the whole word"  (the  -w  option)
       can  be tricky sometimes.  For example, glimpse -w 'word$'
       will not match 'word' appearing at  the  end  of  a  line,
       because the extra '$' makes the pattern more than just one
       simple word.  The same thing can happen with ^ and with _.
       To  be  on  the safe side, use the -w option only when the
       patterns are actual words.
       zona.edu.


DIAGNOSTICS

       Exit  status  is  0 if any matches are found, 1 if none, 2
       for syntax errors or inaccessible files.


AUTHORS

       Udi Manber and Burra Gopal, Department  of  Computer  Sci-
       ence,  University  of  Arizona,  and  Sun Wu, the National
       Chung-Cheng University, Taiwan.  (Email:   glimpse@cs.ari-
       zona.edu)


Man(1) output converted with man2html