glimpse
glimpse 4.1 - search quickly through entire file systems
OVERVIEW
Glimpse (which stands for GLobal IMPlicit SEarch) is a
very popular UNIX indexing and query system that allows
you to search through a large set of files very quickly.
Glimpse supports most of agrep's options (agrep is our
powerful version of grep) including approximate matching
(e.g., finding misspelled words), Boolean queries, and
even some limited forms of regular expressions. It is
used in the same way, except that you don't have to spec-
ify file names. So, if you are looking for a needle any-
where in your file system, all you have to do is say
glimpse needle and all lines containing needle will appear
preceded by the file name.
To use glimpse you first need to index your files with
glimpseindex. For example, glimpseindex -o ~ will index
everything at or below your home directory. See man
glimpseindex for more details.
Glimpse is also available for web sites, as a set of tools
called WebGlimpse. (The old glimpseHTTP is no longer sup-
ported and is not recommended.) See
http://glimpse.cs.arizona.edu/webglimpse/ for more infor-
mation.
Glimpse includes all of agrep and can be used instead of
agrep by giving a file name(s) at the end of the command.
This will cause glimpse to ignore the index and run agrep
as usual. For example, glimpse -1 pattern file is the
same as agrep -1 pattern file. Agrep is distributed as a
self-contained package within glimpse, and can be used
separately. We added a new option to agrep: -r searches
recursively the directory and everything below it (see
agrep options below); it is used only when glimpse reverts
to agrep.
Mail glimpse-request@cs.arizona.edu to be added to the
glimpse mailing list. Mail glimpse@cs.arizona.edu to
report bugs, ask questions, discuss tricks for using
glimpse, etc. (this is a moderated mailing list with very
little traffic, mostly announcements). HTML version of
these manual pages can be found in http://glimpse.cs.ari-
zona.edu/glimpsehelp.html Also, see the glimpse home pages
in http://glimpse.cs.arizona.edu/
SYNOPSIS
glimpse - [almost all letters] pattern
INTRODUCTION
We start with simple ways to use glimpse and describe all
saying
glimpse pattern
The output of glimpse is similar to that of agrep (or any
other grep). The pattern can be any agrep legal pattern
including a regular expression or a Boolean query (e.g.,
searching for Tucson AND Arizona is done by glimpse 'Tuc-
son;Arizona').
The speed of glimpse depends mainly on the number and
sizes of the files that contain a match and only to a sec-
ond degree on the total size of all indexed files. If the
pattern is reasonably uncommon, then all matches will be
reported in a few seconds even if the indexed files total
500MB or more. Some information on how glimpse works and
a reference to a detailed article are given below.
Most of agrep (and other grep's) options are supported,
including approximate matching. For example,
glimpse -1 'Tuson;Arezona'
will output all lines containing both patterns allowing
one spelling error in any of the patterns (either inser-
tion, deletion, or substitution), which in this case is
definitely needed.
glimpse -w -i 'parent'
specifies case insensitive (-i) and match on complete
words (-w). So 'Parent' and 'PARENT' will match, 'par-
ent/child' will match, but 'parenthesis' or 'parents' will
not match. (Starting at version 3.0, glimpse can be much
faster when these two options are specified, especially
for very large indexes. You may want to set an alias
especially for "glimpse -w -i".)
The -F option provides a pattern that must match the file
name. For example,
glimpse -F '\.c$' needle
will find the pattern needle in all files whose name ends
with .c. (Glimpse will first check its index to determine
which files may contain the pattern and then run agrep on
the file names to further limit the search.) The -F
option should not be put at the end after the main pattern
(e.g., "glimpse needle -F hay" is incorrect).
A Detailed Description of All the Options of Glimpse
-# # is an integer between 1 and 8 specifying the max-
ally, each insertion, deletion, or substitution
counts as one error. It is possible to adjust the
relative cost of insertions, deletions and substi-
tutions (see -I -D and -S options). Since the
index stores only lower case characters, errors of
substituting upper case with lower case may be
missed (see LIMITATIONS). Allowing errors in the
match requires more time and can slow down the
match by a factor of 2-4. Be very careful when
specifying more than one error, as the number of
matches tend to grow very quickly.
-a prints attribute names. This option applies only
to Harvest SOIF structured data (used with glimp-
seindex -s). (See http://harvest.transarc.com for
more information about the Harvest project.)
-A used for glimpse internals.
-b prints the byte offset (from the beginning of the
file) of the end of each match. The first charac-
ter in a file has offset 0.
-B Best match mode. (Warning: -B sometimes misses
matches. It is safer to specify the number of
errors explicitly.) When -B is specified and no
exact matches are found, glimpse will continue to
search until the closest matches (i.e., the ones
with minimum number of errors) are found, at which
point the following message will be shown: "the
best match contains x errors, there are y matches,
output them? (y/n)" This message refers to the num-
ber of matches found in the index. There may be
many more matches in the actual text (or there may
be none if -F is used to filter files). When the
-#, -c, or -l options are specified, the -B option
is ignored. In general, -B may be slower than -#,
but not by very much. Since the index stores only
lower case characters, errors of substituting upper
case with lower case may be missed (see LIMITA-
TIONS).
-c Display only the count of matching records. Only
files with count > 0 are displayed.
-C tells glimpse to send its queries to glimpseserver.
-d 'delim'
Define delim to be the separator between two
records. The default value is '$', namely a record
is by default a line. delim can be a string of
size at most 8 (with possible use of ^ and $), but
delim is considered as one record. For example, -d
'$$' defines paragraphs as records and -d '^From '
defines mail messages as records. glimpse matches
each record separately. This option does not cur-
rently work with regular expressions. The -d
option is especially useful for Boolean AND
queries, because the patterns need not appear in
the same line but in the same record. For example,
glimpse -F mail -d '^From ' 'glimpse;ari-
zona;announcement' will output all mail messages
(in their entirety) that have the 3 patterns any-
where in the message (or the header), assuming that
files with 'mail' in their name contain mail mes-
sages. If you want the scope of the record to be
the whole file, use the -W option. Glimpse warn-
ing: Use this option with care. If the delimiter
is set to match mail messages, for example, and
glimpse finds the pattern in a regular file, it may
not find the delimiter and will therefore output
the whole file. (The -t option - see below - can
be used to put the delim at the end of the record.)
Performance Note: Agrep (and glimpse) resorts to
more complex search when the -d option is used.
The search is slower and unfortunately no more than
32 characters can be used in the pattern.
-Dk Set the cost of a deletion to k (k is a positive
integer). This option does not currently work with
regular expressions.
-e pattern
Same as a simple pattern argument, but useful when
the pattern begins with a `-'.
-E prints the lines in the index (as they appear in
the index) which match the pattern. Used mostly
for debugging and maintenance of the index. This
is not an option that a user needs to know about.
-f file_name
this option has a different meaning for agrep than
for glimpse: In glimpse, only the files whose names
are listed in file_name are matched. (The file
names have to appear as in .glimpse_filenames.) In
agrep, the file_name contains the list of the pat-
terns that are searched. (Starting at version 3.6,
this option for glimpse is much faster for large
files.)
-F file_pattern
limits the search to those files whose name
(including the whole path) matches file_pattern.
large index. If file_pattern matches a directory,
then all files with this directory on their path
will be considered. To limit the search to actual
file names, use $ at the end of the pattern.
file_pattern can be a regular expression and even a
Boolean pattern. This option is implemented by
running agrep file_pattern on the list of file
names obtained from the index. Therefore, search-
ing the index itself takes the same amount of time,
but limiting the second phase of the search to only
a few files can speed up the search significantly.
For example,
glimpse -F 'src#\.c$' needle
will search for needle in all .c files with src
somewhere along the path. The -F file_pattern must
appear before the search pattern (e.g., glimpse
needle -F '\.c$' will not work). It is possible to
use some of agrep's options when matching file
names. In this case all options as well as the
file_pattern should be in quotes. (-B and -v do
not work very well as part of a file_pattern.) For
example,
glimpse -F '-1 \.html' pattern
will allow one spelling error when matching .html
to the file names (so ".htm" and ".shtml" will
match as well).
glimpse -F '-v \.c$' counter
will search for 'counter' in all files except for
.c files.
-g prints the file number (its position in the
.glimpse_filenames file) rather than its name.
-G Output the (whole) files that contain a match.
-h Do not display filenames.
-H directory_name
searches for the index and the other .glimpse files
in directory_name. The default is the home direc-
tory. This option is useful, for example, if sev-
eral different indexes are maintained for different
archives (e.g., one for mail messages, one for
source code, one for articles).
-i Case-insensitive search -- e.g., "A" and "a" are
Performance Note: When -i is used together with the
-w option, the search may become much faster. It
is recommended to have -i and -w as defaults, for
example, through an alias. We use the following
alias in our .cshrc file
alias glwi 'glimpse -w -i'
-Ik Set the cost of an insertion to k (k is a positive
integer). This option does not currently work with
regular expressions.
-j If the index was constructed with the -t option,
then -j will output the files last modification
dates in addition to everything else. There are no
major performance penalties for this option.
-J host_name
used in conjunction with glimpseserver (-C) to con-
nect to one particular server.
-k No symbol in the pattern is treated as a meta char-
acter. For example, glimpse -k 'a(b|c)*d' will
find the occurrences of a(b|c)*d whereas glimpse
'a(b|c)*d' will find substrings that match the reg-
ular expression 'a(b|c)*d'. (The only exception is
^ at the beginning of the pattern and $ at the end
of the pattern, which are still interpreted in the
usual way. Use \^ or \$ if you need them verba-
tim.)
-K port_number
used in conjunction with glimpseserver (-C) to con-
nect to one particular server at the specified TCP
port number.
-l Output only the files names that contain a match.
This option differs from the -N option in that the
files themselves are searched, but the matching
lines are not shown.
-L x | x:y | x:y:z
if one number is given, it is a limit on the total
number of matches. Glimpse outputs only the first
x matches. If -l is used (i.e., only file names
are sought), then the limit is on the number of
files; otherwise, the limit is on the number of
records. If two numbers are given (x:y), then y is
an added limit on the total number of files. If
three numbers are given (x:y:z), then z is an added
limit on the number of matches per file. If any of
the x, y, or z is set to 0, it means to ignore it
(in other words 0 = infinity in this case); for
is particularly useful for servers that needs to
limit the amount of output provided to clients.
-m used for glimpse internals.
-M used for glimpse internals.
-n Each matching record (line) is prefixed by its
record (line) number in the file. Performance
Note: To compute the record/line number, agrep
needs to search for all record delimiters (or line
breaks), which can slow down the search.
-N searches only the index (so the search is faster).
If -o or -b are used then the result is the number
of files that have a potential match plus a prompt
to ask if you want to see the file names. (If -y
is used, then there is no prompt and the names of
the files will be shown.) This could be a way to
get the matching file names without even having
access to the files themselves. However, because
only the index is searched, some potential matches
may not be real matches. In other words, with -N
you will not miss any file but you may get extra
files. For example, since the index stores every-
thing in lower case, a case-sensitive query may
match a file that has only a case-insensitive
match. Boolean queries may match a file that has
all the keywords but not in the same line (indexing
with -b allows glimpse to figure out whether the
keywords are close, but it cannot figure out from
the index whether they are exactly on the same line
or in the same record without looking at the file).
If the index was not build with -o or -b, then this
option outputs the number of blocks matching the
pattern. This is useful as an indication of how
long the search will take. All files are parti-
tioned into usually 200-250 blocks. The file
.glimpse_statistics contains the total number of
blocks (or glimpse -N a will give a pretty good
estimate; only blocks with no occurrences of 'a'
will be missed).
-o the opposite of -t: the delimiter is not output at
the tail, but at the beginning of the matched
record.
-O the file names are not printed before every matched
record; instead, each filename is printed just
once, and all the matched records within it are
printed after it.
you to utilize compressed `neighborhoods' (sets of
filenames) to limit your search, without uncom-
pressing them. Added mostly for WebGlimpse. The
usage is:
"-p filename:X:Y:Z" where "filename" is the file
with compressed neighborhoods, X is an offset into
that file (usually 0, must be a multiple of
sizeof(int)), Y is the length glimpse must access
from that file (if 0, then whole file; must be a
multiple of sizeof(int)), and Z must be 2 (it indi-
cates that "filename" has the sparse-set represen-
tation of compressed neighborhoods: the other val-
ues are for internal use only). Note that any colon
":" in filename must be escaped using a backslash .
-P used for glimpse internals.
-q prints the offsets of the beginning and end of each
matched record. The difference between -q and -b
is that -b prints the offsets of the actual matched
string, while -q prints the offsets of the whole
record where the match occurred. The output format
is @x{y}, where x is the beginning offset and y is
the end offset.
-Q when used together with -N glimpse not only dis-
plays the filename where the match occurs, but the
exact occurrences (offsets) as seen in the index.
This option is relevant only if the index was built
with -b; otherwise, the offsets are not available
in the index. This option is ignored when used not
with -N.
-r This option is an agrep option and it will be
ignored in glimpse, unless glimpse is used with a
file name at the end which makes it run as agrep.
If the file name is a directory name, the -r option
will search (recursively) the whole directory and
everything below it. (The glimpse index will not
be used.)
-R k defines the maximum size (in bytes) of a record.
The maximum value (which is the default) is 48K.
Defining the maximum to be lower than the deafult
may speed up some searches.
-s Work silently, that is, display nothing except
error messages. This is useful for checking the
error status.
-Sk Set the cost of a substitution to k (k is a posi-
tive integer). This option does not currently work
is assumed to appear at the end of the record.
Glimpse will output the record starting from the
end of delim to (and including) the next delim.
(See warning for the -d option.)
-T directory
Use directory as a place where temporary files are
built. (Glimpse produces some small temporary
files usually in /tmp.) This option is useful
mainly in the context of structured queries for the
Harvest project, where the temporary files may be
non-trivial, and the /tmp directory may not have
enough space for them.
-U (starting at version 4.0B1) Interprets an index
created with the -X or the -U option in glimpsein-
dex. Useful mostly for WebGlimpse or similar web
applications. When glimpse outputs matches, it
will display the filename, the URL, and the title
automatically.
-v (This option is an agrep option and it will be
ignored in glimpse, unless glimpse is used with a
file name at the end which makes it run as agrep.)
Output all records/lines that do not contain a
match. (Glimpse does not support the NOT operator
yet.)
-V prints the current version of glimpse.
-w Search for the pattern as a word -- i.e., sur-
rounded by non-alphanumeric characters. For exam-
ple, glimpse -w car will match car, but not charac-
ters and not car10. The non-alphanumeric must sur-
round the match; they cannot be counted as errors.
This option does not work with regular expressions.
Performance Note: When -w is used together with the
-i option, the search may become much faster. The
-w will not work with $, ^, and _ (see BUGS below).
It is recommended to have -i and -w as defaults,
for example, through an alias. We use the follow-
ing alias in our .cshrc file
alias glwi 'glimpse -w -i'
-W The default for Boolean AND queries is that they
cover one record (the default for a record is one
line) at a time. For example, glimpse 'good;bad'
will output all lines containing both 'good' and
'bad'. The -W option changes the scope of Booleans
to be the whole file. Within a file glimpse will
output all matches to any of the patterns. So,
glimpse -W 'good;bad' will output all lines con-
used only with -W. It is described later on. The
OR operator is essentially unaffected (unless it is
in combination with the other Boolean operations).
For structured queries, the scope is always the
whole attribute or file.
-x The pattern must match the whole line. (This
option is translated to -w when the index is
searched and it is used only when the actual text
is searched. It is of limited use in glimpse.)
-X (from version 4.0B1 only) Output the names of files
that contain a match even if these files have been
deleted since the index was built. Without this
option glimpse will simply ignore these files.
-y Do not prompt. Proceed with the match as if the
answer to any prompt is y. Servers (or any other
scripts) using glimpse will probably want to use
this option.
-Y k If the index was constructed with the -t option,
then -Y x will output only matches to files that
were created or modified within the last x days.
There are no major performance penalties for this
option.
-z Allow customizable filtering, using the file
.glimpse_filters to perform the programs listed
there for each match. The best example is com-
press/decompress. If .glimpse_filters include the
line
*.Z uncompress <
(separated by tabs) then before indexing any file
that matches the pattern "*.Z" (same syntax as the
one for .glimpse_exclude) the command listed is
executed first (assuming input is from stdin, which
is why uncompress needs <) and its output (assuming
it goes to stdout) is indexed. The file itself is
not changed (i.e., it stays compressed). Then if
glimpse -z is used, the same program is used on
these files on the fly. Any program can be used
(we run 'exec'). For example, one can filter out
parts of files that should not be indexed. Glimp-
seindex tries to apply all filters in .glimpse_fil-
ters in the order they are given. For example, if
you want to uncompress a file and then extract some
part of it, put the compression command (the exam-
ple above) first and then another line that speci-
fies the extraction. Note that this can slow down
the search because the filters need to be run
before files are searched. (See also glimpsein-
us.)
The characters `$', `^', `*', `[', `]', `^', `|', `(',
`)', `!', and `\' can cause unexpected results when
included in the pattern, as these characters are also
meaningful to the shell. To avoid these problems, enclose
the entire pattern in single quotes, i.e., 'pattern'. Do
not use double quotes (").
PATTERNS
glimpse supports a large variety of patterns, including
simple strings, strings with classes of characters, sets
of strings, wild cards, and regular expressions (see LIMI-
TATIONS).
Strings
Strings are any sequence of characters, including
the special symbols `^' for beginning of line and
`$' for end of line. The following special charac-
ters ( `$', `^', `*', `[', `^', `|', `(', `)', `!',
and `\' ) as well as the following meta characters
special to glimpse (and agrep): `;', `,', `#', `<',
`>', `-', and `.', should be preceded by `\' if
they are to be matched as regular characters. For
example, \^abc\\ corresponds to the string ^abc\,
whereas ^abc corresponds to the string abc at the
beginning of a line.
Classes of characters
a list of characters inside [] (in order) corre-
sponds to any character from the list. For exam-
ple, [a-ho-z] is any character between a and h or
between o and z. The symbol `^' inside [] comple-
ments the list. For example, [^i-n] denote any
character in the character set except character 'i'
to 'n'. The symbol `^' thus has two meanings, but
this is consistent with egrep. The symbol `.'
(don't care) stands for any symbol (except for the
newline symbol).
Boolean operations
Glimpse supports an `AND' operation denoted by the
symbol `;' an `OR' operation denoted by the symbol
`,', a limited version of a 'NOT' operation (start-
ing at version 4.0B1) denoted by the symbol `~', or
any combination. For example, glimpse
'pizza;cheeseburger' will output all lines contain-
ing both patterns. glimpse -F 'gnu;\.c$'
'define;DEFAULT' will output all lines containing
both 'define' and 'DEFAULT' (anywhere in the line,
not necessarily in order) in files whose name con-
tains 'gnu' and ends with .c. glimpse '{politi-
tion works only together with the -W option and it
is generally applies only to the whole file rather
to individual records. Its output may sometimes
seem counterintuitive. Use with care. glimpse -W
'fame;~glory' will output all lines containing
'fame' in all files that contain 'fame' but do not
contain 'glory'; This is the most common use of
NOT, and in this case it works as expected.
glimpse -W '~{fame;glory}' will be limited to files
that do not contain both words, and will output all
lines containing one of them.
Wild cards
The symbol '#' is used to denote a sequence of any
number (including 0) of arbitrary characters (see
LIMITATIONS). The symbol # is equivalent to .* in
egrep. In fact, .* will work too, because it is a
valid regular expression (see below), but unless
this is part of an actual regular expression, #
will work faster. (Currently glimpse is experienc-
ing some problems with #.)
Combination of exact and approximate matching
Any pattern inside angle brackets <> must match the
text exactly even if the match is with errors. For
example, <mathemat>ics matches mathematical with
one error (replacing the last s with an a), but
mathe<matics> does not match mathematical no matter
how many errors are allowed. (This option is buggy
at the moment.)
Regular expressions
Since the index is word based, a regular expression
must match words that appear in the index for
glimpse to find it. Glimpse first strips the regu-
lar expression from all non-alphabetic characters,
and searches the index for all remaining words. It
then applies the regular expression matching algo-
rithm to the files found in the index. For exam-
ple, glimpse 'abc.*xyz' will search the index for
all files that contain both 'abc' and 'xyz', and
then search directly for 'abc.*xyz' in those files.
(If you use glimpse -w 'abc.*xyz', then 'abcxyz'
will not be found, because glimpse will think that
abc and xyz need to be matches to whole words.)
The syntax of regular expressions in glimpse is in
general the same as that for agrep. The union
operation `|', Kleene closure `*', and parentheses
() are all supported. Currently '+' is not sup-
ported. Regular expressions are currently limited
to approximately 30 characters (generally excluding
meta characters). Some options (-d, -w, -t, -x,
ular expressions that use '*' or '|' is 4. (See
LIMITATIONS.)
structured queries
Glimpse supports some form of structured queries
using Harvest's SOIF format. See STRUCTURED
QUERIES below for details.
EXAMPLES
(Run "glimpse '^glimpse' this-file" to get a list of all
examples, some of which were given earlier.)
glimpse -F 'haystack.h$' needle
finds all needles in all haystack.h's files.
glimpse -2 -F html Anestesiology
outputs all occurrences of Anestesiology with two
errors in files with html somewhere in their full
name.
glimpse -l -F '\.c$' variablename
lists the names of all .c files that contain vari-
ablename (the -l option lists file names rather
than output the matched lines).
glimpse -F 'mail;1993' 'windsurfing;Arizona'
finds all lines containing windsurfing and Arizona
in all files having `mail' and '1993' somewhere in
their full name.
glimpse -F mail 't.j@#uk'
finds all mail addresses (search only files with
mail somewhere in their name) from the uk, where
the login name ends with t.j, where the . stands
for any one character. (This is very useful to
find a login name of someone whose middle name you
don't know.)
glimpse -F mbox -h -G . > MBOX
concatenates all files whose name matches `mbox'
into one big one.
SEARCHING IN COMPRESSED FILES
Glimpse includes an optional new compression program,
called cast, which allows glimpse (and agrep) to search
the compressed files without having to decompress them.
The search is actually significantly faster when the files
are compressed. However, we have not tested cast as thor-
oughly as we would have liked, and a mishap in a compres-
sion algorithm can cause loss of data, so we recommend at
this point to use cast very carefully. We do not support
or maintain cast. (Unless you specifically use cast, the
All files used by glimpse are located at the direc-
tory(ies) where the index(es) is (are) stored and have
.glimpse_ as a prefix. The first two files
(.glimpse_exclude and .glimpse_include) are optionally
supplied by the user. The other files are built and read
by glimpse.
.glimpse_exclude
contains a list of files that glimpseindex is
explicitly told to ignore. In general, the syntax
of .glimpse_exclude/include is the same as that of
agrep (or any other grep). The lines in the
.glimpse_exclude file are matched to the file
names, and if they match, the files are excluded.
Notice that agrep matches to parts of the string!
e.g., agrep /ftp/pub will match /home/ftp/pub and
/ftp/pub/whatever. So, if you want to exclude
/ftp/pub/core, you just list it, as is, in the
.glimpse_exclude file. If you put
"/home/ftp/pub/cdrom" in .glimpse_exclude, every
file name that matches that string will be
excluded, meaning all files below it. You can use
^ to indicate the beginning of a file name, and $
to indicate the end of one, and you can use * and ?
in the usual way. For example /ftp/*html will
exclude /ftp/pub/foo.html, but will also exclude
/home/ftp/pub/html/whatever; if you want to
exclude files that start with /ftp and end with
html use ^/ftp*html$ Notice that putting a * at the
beginning or at the end is redundant (in fact, in
this case glimpseindex will remove the * when it
does the indexing). No other meta characters are
allowed in .glimpse_exclude (e.g., don't use .* or
# or |). Lines with * or ? must have no more than
30 characters. Notice that, although the index
itself will not be indexed, the list of file names
(.glimpse_filenames) will be indexed unless it is
explicitly listed in .glimpse_exclude.
.glimpse_filters
See the description above for the -z option.
.glimpse_include
contains a list of files that glimpseindex is
explicitly told to include in the index even though
they may look like non-text files. Symbolic links
are followed by glimpseindex only if they are
specifically included here. If a file is in both
.glimpse_exclude and .glimpse_include it will be
excluded.
contains the list of all indexed file names, one
per line. This is an ASCII file that can also be
used with agrep to search for a file name leading
to a fast find command. For example,
glimpse 'count#\.c$' ~/.glimpse_filenames
will output the names of all (indexed) .c files
that have 'count' in their name (including anywhere
on the path from the index). Setting the following
alias in the .login file may be useful:
alias findfile 'glimpse -h :1 ~/.glimpse_filenames'
.glimpse_index
contains the index. The index consists of lines,
each starting with a word followed by a list of
block numbers (unless the -o or -b options are
used, in which case each word is followed by an
offset into the file .glimpse_partitions where all
pointers are kept). The block/file numbers are
stored in binary form, so this is not an ASCII
file.
.glimpse_messages
contains the output of the -w option (see above).
.glimpse_partitions
contains the partition of the indexed space into
blocks and, when the index is built with the -o or
-b options, some part of the index. This file is
used internally by glimpse and it is a non-ASCII
file.
.glimpse_statistics
contains some statistics about the makeup of the
index. Useful for some advanced applications and
customization of glimpse.
.glimpse_turbo
An added data structure (used under glimpseindex -o
or -b only) that helps to speed up queries signifi-
cantly for large indexes. Its size is 0.25MB.
Glimpse will work without it if needed.
STRUCTURED QUERIES
Glimpse can search for Boolean combinations of
"attribute=value" terms by using the Harvest SOIF parser
library (in glimpse/libtemplate). To search this way, the
index must be made by using the -s option of glimpseindex
(this can be used in conjunction with other glimpseindex
options). For glimpse and glimpseindex to recognize
"structured" files, they must be in SOIF format. In this
format, each value is prefixed by an attribute-name with
the size of the value (in bytes) present in "{}" after the
type{17}: Directory-Listing
md5{32}: 3858c73d68616df0ed58a44d306b12ba
Any string can serve as an attribute name. Glimpse "pat-
tern;type=Directory-Listing" will search for "pattern"
only in files whose type is "Directory-Listing". The file
itself is considered to be one "object" and its name/url
appears as the first attribute with an "@" prefix; e.g.,
@FILE { http://xxx... } The scope of Boolean operations
changes from records (lines) to whole files when struc-
tured queries are used in glimpse (since individual query
terms can look at different attributes and they may not be
"covered" by the record/line). Note that glimpse can only
search for patterns in the value parts of the SOIF file:
there are some attributes (like the TTL, MD5, etc.) that
are interpreted by Harvest's internal routines. See
http://harvest.cs.colorado.edu/harvest/user-manual/ for
more detailed information of the SOIF format.
REFERENCES
1. U. Manber and S. Wu, "GLIMPSE: A Tool to Search
Through Entire File Systems," Usenix Winter 1994
Technical Conference (best paper award), San Fran-
cisco (January 1994), pp. 23-32. Also, Technical
Report #TR 93-34, Dept. of Computer Science, Uni-
versity of Arizona, October 1993 (a postscript file
is available by anonymous ftp at ftp://ftp.cs.ari-
zona.edu/reports/1993/TR93-34.ps).
2. S. Wu and U. Manber, "Fast Text Searching Allowing
Errors," Communications of the ACM 35 (October
1992), pp. 83-91.
SEE ALSO
agrep(1), ed(1), ex(1), glimpseindex(1), glimpseserver(1),
grep(1), sh(1), csh(1).
LIMITATIONS
The index of glimpse is word based. A pattern that con-
tains more than one word cannot be found in the index.
The way glimpse overcomes this weakness is by splitting
any multi-word pattern into its set of words and looking
for all of them in the index. For example, glimpse 'lin-
ear programming' will first consult the index to find all
files containing both linear and programming, and then
apply agrep to find the combined pattern. This is usually
an effective solution, but it can be slow for cases where
both words are very common, but their combination is not.
As was mentioned in the section on PATTERNS above, some
characters serve as meta characters for glimpse and need
to be preceded by '\' to search for them. The most common
examples are the characters '.' (which stands for a wild
"glimpse ab*de" will not match ab*de, but "glimpse ab\*de"
will. The meta character - is translated automatically to
a hypen unless it appears between [] (in which case it
denotes a range of characters).
The index of glimpse stores all patterns in lower case.
When glimpse searches the index it first converts all pat-
terns to lower case, finds the appropriate files, and then
searches the actual files using the original patterns.
So, for example, glimpse ABCXYZ will first find all files
containing abcxyz in any combination of lower and upper
cases, and then searches these files directly, so only the
right cases will be found. One problem with this approach
is discovering misspellings that are caused by wrong
cases. For example, glimpse -B abcXYZ will first search
the index for the best match to abcxyz (because the pat-
tern is converted to lower case); it will find that there
are matches with no errors, and will go to those files to
search them directly, this time with the original upper
cases. If the closest match is, say AbcXYZ, glimpse may
miss it, because it doesn't expect an error. Another
problem is speed. If you search for "ATT", it will look
at the index for "att". Unless you use -w to match the
whole word, glimpse may have to search all files contain-
ing, for example, "Seattle" which has "att" in it.
There is no size limit for simple patterns and simple pat-
terns within Boolean expressions. More complicated pat-
terns, such as regular expressions, are currently limited
to approximately 30 characters. Lines are limited to 1024
characters. Records are limited to 48K, and may be trun-
cated if they are larger than that. The limit of record
length can be changed by modifying the parameter
Max_record in agrep.h.
Glimpseindex does not index words of size > 64.
BUGS
In some rare cases, regular expressions using * or # may
not match correctly.
A query that contains no alphanumeric characters is not
recommended (unless glimpse is used as agrep and the file
names are provided). This is an understatement.
The notion of "match to the whole word" (the -w option)
can be tricky sometimes. For example, glimpse -w 'word$'
will not match 'word' appearing at the end of a line,
because the extra '$' makes the pattern more than just one
simple word. The same thing can happen with ^ and with _.
To be on the safe side, use the -w option only when the
patterns are actual words.
zona.edu.
DIAGNOSTICS
Exit status is 0 if any matches are found, 1 if none, 2
for syntax errors or inaccessible files.
AUTHORS
Udi Manber and Burra Gopal, Department of Computer Sci-
ence, University of Arizona, and Sun Wu, the National
Chung-Cheng University, Taiwan. (Email: glimpse@cs.ari-
zona.edu)
Man(1) output converted with
man2html