Chapter 5 Searching

by David Harlan

CONTENTS

Searching the Full Text of Your Site
Using an Index Search
- Indexing Your Web Site into a DBM File
- Performing a Search Using the Index File
Printing the Resulting Pages
- Returning Pages from the Nonindex Search
- Returning Pages from the Index Search
From Here

The first part of this book spent a great deal of time looking at ways to deal with user input, parse it, view it, and return it to the user for editing. This chapter and Chapter 6 "Using Dynamic Pages," focus more on what CGI can do for pages that don't contain user data.

Web creators are always striving to put maximum content on a site. You know that the best sites out there are full of useful (or at least interesting) information. You also know that those sites make that information easy to find. This chapter closely examines a key feature that can make the data on your site more accessible: searching.

Searching the Full Text of Your Site

If you've used the Web much, you've seen any number of Web-site search features. You've also discovered that some of these features are useful and that others aren't. But what makes the difference? From a user's perspective, the two main factors are speed and accuracy. A Web programmer has to weigh these two factors against the available resources to determine what kind of search feature is best for the site.

In many cases, simple search functionality is all that's required. If you have a small site and limited storage space on your server, for example, you can't afford to add several large files to your file structure just to support a search feature. But you might decide that you can afford to have a program search every file every time a request comes in, which is what you'll be doing in this first example.

Scanning Directories Using a Recursive Subroutine

When an inexperienced programmer first examines the problem of scanning every file in a directory tree, he might be tempted to hard-code all those directories into the search script. You probably can guess that this action is a bad idea-it would not only make for ugly code, but also play havoc with the maintainability of your site. Fortunately, as with eval() and dynamic code in the preceding chapter, other options are available. The most obvious place to start is the search form, which at its simplest looks like figure 5.1.

Figure 5.1 : The user fills in and submits this form to perform a search of the site.

When a user enters a word or phrase into this form and presses Return, he hopes that the script will return the information that he's looking for. The only way that you can be sure to do this for the user is to make sure that you check every file on your site for that term, and fortunately, Perl provides some great tools that do just that. Listing 5.1 runs through a script that processes the form shown in figure 5.1.

Listing 5.1 Part I of a Relatively Simple Search Script (search.pl )


#!/usr/bin/perl



require "process_cgi.pl";



&parse_input(*fields);

$search=$fields{'word'};

&print_header;



if (length($search) < 4) {

 print "Search terms of less than four letters are not allowed.";

 exit 0;

}



$found=0;



@tag=split(/\0/,$fields{'tag'});

$tags=join ("::",@tag);



$directory='/usr/local/etc/httpd/htdocs/whitehorse';



print "<pre>";

&scan_files($directory,$search);

print "</pre>";

The code shown in Listing 5.1 is the entire main program for our search function. The first thing that this script does is read in the ubiquitous process_cgi library; it then prints the page header and reads the data from the form into variables. If the search term is shorter than four characters, the script tells the user a polite version of "Sorry, Bud" and exits; otherwise, it sets $directory to the desired starting point in the directory structure. Finally, the script calls &scan_files, and you can probably guess what goes on there. But instead of guessing, look at this subroutine in Listing 5.2.

Listing 5.2 Part I of the &scan_files Subroutine from search.pl


sub  scan_files {

 my $dir=$_[0];

 my $st=$_[1];

 my (@dirs,@files,@results,$filename,$newdir,$list);

 opendir(dir,$dir);

 @dirs=grep {!(/^\./) && -d "$dir/$_"} readdir(dir);

 rewinddir(dir);

 @files=grep {!(/^\./) && /html$/ && -T "$dir/$_"} readdir(dir);

 closedir (dir);

 for $list(0..$#dirs) {

  if (!($dirs[$list]=[td]/temp/ || $dirs[$list]=[td]/images/)) {

   $newdir=$dir."/".$dirs[$list];

   &scan_files ($newdir,$st);

  }

 }

Right off the bat in Listing 5.2, you see a new piece of Perl. The first three lines of the subroutine begin with my. These lines aren't really being selfish; my actually is a call to the Perl 5 function by that name. The my function tells Perl that the listed variables should be treated as though they exist only in the current program block. This function allows you to set the variable at the same time as well (for example, my $dir=$_[0];) or to simply declare a list of local variables. The importance of these declarations will become clear to you soon.

After the variables are declared, you see a call to the opendir() function. This function creates a directory handle (similar to a file handle in the open() function) pointing to the specified directory. Directory handles are used by a set of Perl functions that allow the programmer to process directories in various ways.

One of these functions is readdir( ), which appears in the next line. Having never seen the grep function, however, you likely are not too clear about what's going on there, so the following paragraphs examine that line in detail.

grep is named after a standard function on most UNIX-type operating systems that scans files for certain specified text. Perl's grep() function performs a similar but perhaps broader role. The function takes an expression and a list as arguments; it evaluates the expression for each element in the list and returns all the items from the list for which the expression evaluated as true.

The expression in this case-!(/^\./) && -d "$dir/$_"-has two parts, connected by Perl's and operator (&&). You should be able to decipher the first part; it says that you want the items from the list that don't start with a period. You may not recognize the second part of the expression, however. That part uses one of Perl's file-test operators: -d. This operator evaluates as true, if the argument to its right is a directory. In this case, the argument to the right is a string that, when the two variables are dereferenced, will contain each successive item in the list that you are checking, prepended with (the first time that the subroutine is executed) /usr/local/etc/httpd/whitehorse and a slash.

So here's what happens: readdir() returns an array that contains each item in the directory in a single element of that array. Because you are giving readdir() as an argument to the grep statement, the entire statement sets the array @dir to the list of all the directories within the original directory that do not start with a period.

The statement that follows-rewinddir (dir)-resets the pointer on the specified directory handle to the beginning of the directory, so that you can scan the listing again in the next line. This time, you are looking for files that do not begin with a period, that contain the string html at the end, and that are text files. The first two tests are accomplished by fairly obvious regular expressions. The third test is accomplished by another file-test operator: -T. The function is identical to its sibling described earlier in this section, with the obvious exception that it returns true for text files.

With the two arrays set, you now close the directory handle and move on to processing the information in the arrays.

The processing starts with the @dirs array, scanning through it by using a foreach loop. Although I could have accessed the data directly by using foreach $listitem(@dirs){, as you've seen before, in this case I used a different but equally valid syntax. This time, I told Perl to iterate from zero to the variable $#dirs. This variable is a special Perl variable that contains the index of the last item of the specified array. Because the script is iterating over the indices instead of the data, you'll notice that the code goes through a little more work within the loop to reference the data in the array itself. This syntax might be easier for some programmers to read and understand, however.

Within the loop itself, the first thing that the script does is determine whether the current item contains the string temp or the string images; if it does, the script skips that item. Why? Well, I know that in this site's directory structure, any directory called IMAGES contains only graphics files, and I don't want to waste processor time on those directories. I also know that any directories whose names contain the string temp do not have any documents that should be considered in the search. Any time I adapt this search script to a new site, I change this conditional accordingly. The new site may call its image directories by a different name; it might have other places where temporary documents are stored.

TIP

Searching is one good reason why Webmasters and CGI programmers want to pay close attention to their directory structures and what is contained in them. In my sites, I try to be as careful as possible in putting documents that are not intended for public consumption in places that search functions skip. I also like to place images in separate directories, mostly to keep them out of the way, but also so that I can cut (if even slightly) the processor time that a search may require.

After the first element passes these tests, I set the variable $newdir to that item, combined with the current directory and a slash. This means that $newdir contains the full directory path to the item. I then call &scan_files with $newdir and the same search string.

Using Recursive Subroutines

This action of a subroutine's calling itself is known as recursion. Recursion is a common programming concept that is taught in almost every basic-to-intermediate programming class. If you aren't an experienced programmer, the concept may seem to be a bit odd, but trust me-it works. Recursion is most commonly used when a programmer needs some kind of looping but each successive iteration of the loop depends on some processing in the preceding iteration. In this example, you're scanning a directory tree. Obviously, you want to scan every subdirectory in that tree, but to do that, you have to scan every subdirectory's subdirectories. You get the idea. So you need to write a subroutine that keeps calling itself until it reaches a point at which there are no more subdirectories to scan.

When the subroutine is called the second time, it starts back at the top, just as described at the beginning of this section. This is where the importance of the my declarations comes in. If I hadn't made those variables local, each successive time the subroutine was called, the newer version of the subroutine would overwrite the data in those variables, and the results would be at best unpredictable and at worst garbage.

As it is, however, each successive call to the subroutine delves one branch deeper in the directory tree, creating a new list of directories and HTML files, and calling itself again for the first item in each new directory list. When a particular incarnation of the &scan_files subroutine calls itself, nothing further happens until the call to the subroutine "returns." This call can't occur until somewhere down the line the script finds a directory with no subdirectories. Then the latest incarnation of the &scan_files script moves on to process the @files array, as described later in this chapter. When that processing is finished, the subroutine returns successfully, and the preceding incarnation of &scan_files can move on to the next item in its directory list, calling itself again.

The processing moves on like this until all the results essentially cascade up from the bottom of the tree. When all the directories in @dirs from any incarnation of &scan_files are processed, the files from that particular incarnation can be processed. Finally, the script returns to the preceding &scan_files, and so it goes, back up the tree. When all the subdirectories in the original directory are finally processed, the files in that directory are processed-last. Again, it may seem to be odd at first, but if you take a simple directory tree and a paper and pencil, and trace out how it works, you'll see what I mean.

Processing the Files in Each Directory

The actual processing of the files is relatively straightforward. As you can see in Listing 5.3, I use a foreach loop to iterate over the indices of the @files array. For each item in that array, I open the file that it points to and scan for the desired text by means of the while (<file>) { loop. You'll remember that this syntax places each successive line of the file in the Perl special variable $_. The first thing that this loop does is find the title of the document.

Listing 5.3 Part II of &scan_files


 for $list(0..$#files) {

  $filename=$dir."/".$files[$list];

  $title=$files[$list];

  open file, $filename;

  while (<file>) {

   if (/<title>([^<]+)<\/title>/i) {

    $title=$1;

   }

   if (/$st/i) {

    s/<[^>]*(>|$)//ig;

    s/^[^>]*>//i;

    if (/$st/i) {

     my $urlsearch=$st;

     $urlsearch=[td]s/ /+/g;

     print "<a href=\"/cgi-bin/showfoundfile/$filename"."::", 

      $urlsearch."::"."$tags\">$title</a><br>\n";

     last;

    }

   }

  }

 }

return 1;

}

The conditional that accomplishes this task uses a match operator to check each line. This expression matches on any line that contains any text surrounded by <title> and </title>. One important thing to notice is that the expression is followed by i. This i tells Perl that I want it to perform a case-insensitive match. So whether the tags look like <TITLE>, <Title>, or <title>, the expression returns true. The characters between the title tags in the conditional ([^<]+) mean that I want to match one or more characters that are not less-than signs. Also, because this expression is enclosed in parentheses, this text is placed in the Perl variable $1 when the expression matches. This fact explains $title=$1;-the line that is executed when the expression returns true. Thus, $title contains the text between the title tags in each scanned file.

The second conditional in the loop checks to see whether the current line contains the search text. If it does, the script does a little further checking. The two lines following this conditional remove any HTML tags from the line, using the s/// substitution operator. The first of these lines removes any complete tags and any tag that begins but does not end on this line. The second line removes a tag that starts on the preceding line and ends in the current line.

Again, these expressions are followed by the i option to indicate case-insensitivity. The first expression also invokes the g option, which tells Perl to perform the substitution on the line as many times as possible. Notice also in the first substitution that the final element in the pattern that is being matched is (>|$). The vertical-bar character (|) in a regular expression indicates alternation. This means that the pattern matches if it finds a > character or the end of the line in this position. This syntax allows you to match both complete tags and tags that start on this line with one expression.

After the script removes the tags from the current line, it checks again to see whether the line contains the search text. If so, you want the script to return a reference to this document to the user. Obviously, you could just return the title of the document, surrounded by an anchor tag and with an href attribute that points directly to the document. But I have a little something up my sleeve, so I don't do that. I want to refer the user to a script that processes and prints the document in a special way.

To do this, first I encode the search string so that any spaces in it become plus signs. Then I print the title of the document, with an anchor tag that refers the browser to a script with the search text and some extra information tacked onto the URL, so that this information ends up in the PATH_INFO variable. The final section of this chapter explains the method to this seeming madness.

The final command in this block, last;, simply saves a little processing time. After the first match in any file, this last; command ends the processing of that file completely. At this point you only need to know that the string was matched once in a file, so there's no need to go on after the first match. Efficient, eh?

Each file in the @files array is processed as described earlier in this section. A reference to any file that contains the search text is printed back to the user. When the processing of the @files array is complete, the subroutine returns, going back to the preceding incarnation of &scan_files to either process the next item in @dirs or (if this happened to be the last item in @dirs) to process its own @files array. When all is said and done, the user is presented with a complete list of files that contain the requested text.

Using an Index Search

I hope you can see how effective the search that I just explained is. The search looks at every file on your site every time someone calls up the search function. You can probably guess one of the problems with this method: processing power. On a small site, a Web server running on "average" hardware can run through this search fairly quickly. But as a site gets bigger, the time that it would take to go through the each file and directory for every search request would become intolerable for the user (and, of course, would also bog down your server). For a bigger site, you want an alternative. One of the best alternatives is to index your site. To index means that you build a file that relates each significant word on your site to a list of the pages on which the word occurs.

Instead of scanning each and every file, this search method simply looks at the index file for its references. With the right structure for the index file, this process can be lightning-quick, taking very little processing power. The major processing takes place in the building of the index file, which would have to occur only once a day (or less frequently on a less dynamic site).

Indexing Your Web Site into a DBM File

The process of indexing a Web site is accomplished with a script that is not technically a CGI script, because it will never be run by your Web server at the request of some remote user. The script is run from the command line by a Webmaster or is made to run automatically at specified times. Many parts of this script look quite similar to the CGI script described earlier in this chapter. Listing 5.4 shows how the script works.

Listing 5.4 Part I of the Script to Index a Web Site (indexsite.pl )


#!/usr/bin/perl



$directory='/usr/local/etc/httpd/htdocs/whitehorse';



dbmopen (%final, "index", 0666);

@time=localtime(time);

$time="$time[2]:$time[1]";

print "Scan started: $time\n";

scan_files($directory);

@time=localtime(time);

$time="$time[2]:$time[1]";

print "Scan complete: $time\n";

The first part of this script should be fairly familiar. Like search.pl in Listing 5.1, this script simply initializes some variables and then launches into its processing by calling a subroutine. Because you're not searching for any specific text here, you don't need to set a search string variable, and you don't need many of the CGI preliminaries. But as in the search script in Listing 5.1, you do set the starting directory. You also open a DBM file, which will be used to store the index information. With that done, the script does some basically unnecessary, but somewhat useful, printing before calling &scan_files. First, the script calls the localtime function, putting the results in the array @time. Then the script prints the time when the scan started, using two of the values from that array. This step is not strictly necessary, but it makes me feel better to see some output from the program as it's running.

Just like the first part of the script, the &scan_files subroutine in Listing 5.5 should look somewhat familiar.

Listing 5.5 Part I of the &scan_files Subroutine of indexsite


sub  scan_files {

 my $dir=$_[0];

 my (@dirs,@files,@results,$filename,$shortfilename,$newdir,$list, %words);

 print "Scanning: $dir \n";

 opendir(dir,$dir);

 @dirs=grep {!(/^\./) && -d "$dir/$_"} readdir(dir);

 rewinddir(dir);

 @files=grep {!(/^\./) && /html/ && -T "$dir/$_"} readdir(dir);

 closedir (dir);

 for $list(0..$#dirs) {

  if (!($dirs[$list]=[td]/temp/ || $dirs[$list]=[td]/images/)) {

   $newdir=$dir."/".$dirs[$list];

   &scan_files ($newdir);

  }

 }

I start the script by setting up my local variables, using the my() function just as I did in search.pl. Then, in an effort to give myself some peace of mind while the script is running, I print the name of the directory that the script is currently scanning. Why do this? By nature, I'm a cynic, and if I can't prove that something is working right, I assume that it's broken. Without the periodic update from the script, I would assume that the script is malfunctioning and that any moment, my server will go up in a ball of flames. Printing each directory name as I process it is an easy way to keep myself from worrying. In a more practical vein, if something does go wrong, the output can help point me to the problem.

The next few lines initialize the @files and @dirs arrays, just as search.pl did. With the arrays populated, the script iterates through @dirs, calling &scan_files for each directory found. Again, the only difference from search.pl is the fact that I don't need to look for any specific text, so I make the recursive call with the directory as the only argument. After all the directories are processed (refer to the description of recursion in "Using Recursive Subroutines" earlier in this chapter, if you haven't read it already), the script turns its attention to the files listed in the @files array.

Listing 5.6 shows where this script varies significantly from search.pl. The reason for this variation should be obvious. Whereas previously I was interested in finding one specific word or phrase, now I want to find every significant word in every document on the site.

Listing 5.6 Part II of &scan_files


 for $list(0..$#files) {

  undef(%words);

  undef(@results);

  $filename=$dir."/".$files[$list];

  $shortfilename=$filename;

  $shortfilename=[td]s/$directory//;

  open file, $filename;

  @file=<file>;

  $file=join(" ",@file);

  $file=[td]s/<[^>]*>/ /gs;

  $file=[td]tr/A-Z/a-z/;

  @results=split (/[^\w-']+/,$file);

  foreach (@results){

   s/^'//;

   s/'$//;

   s/^-//;

   s/-$//;

   if (length($_) > 3) {

    $words{$_}=1;

   }

  }

  foreach (keys(%words)) {

   $final{$_} .= "#$shortfilename";

  }

 }

return 1;

}

To do this, I iterate over the @files array. Each time through the loop, I initialize a few variables. First, I use the undef() function to make sure that the array @results and the hash %words are empty before I go any further. With that task accomplished, I append the directory name to the file name so that I can tell Perl exactly where to find the file. Then I set $shortfilename to contain the path from the original directory to the current file. I do this to keep the data that I will be storing in the DBM file as short as possible. Because I always know that I started with the directory in $directory, I don't need to put that information into the DBM file.

With all the necessary variables initialized, I open the specified file. Instead of using a while loop to look at each line, this time, I dump the entire file into an array. The line @file=<file> accomplishes this feat, putting each line of the current file (from the file handle file) into a field of the array @files. This method is an easy shortcut; use it so that you don't have to use the more common while loop syntax. From this new array, I then create one big string, using the join command.

Because I don't want to index any words that occur inside the actual HTML tags , I want to remove all the tags from the newly created string. The line $file=[td]s/<[^>]*>//gs; performs this task for me. The regular expression in this substitution matches anything that looks like an HTML tag. I tell Perl to substitute a space for each match. The g option that follows means that I want to perform this substitution as many times as possible. The s option tells Perl to treat the string as a single line. With this option enabled, the substitution operator allows new lines to match in the [^>]* portion of the pattern, which means that tags that span lines are removed. Finally, I use "tr/A-Z/a-z/" to translate all uppercase characters in the string to lowercase.

The $file variable now contains only the text of the document, with all words in lowercase. I now use the split function to put the words into the array @results. Although this is not the first time that you've seen split(), this instance is quite different from what I've shown you before. In previous uses, I split a string on some known character or simple string. In this case, I'm splitting based on a regular expression: /[^w-']+/.

What's going on here? I know that at this point in the processing, $file will contain only the text of the file, but I don't know exactly what that means. I might be tempted to split just on spaces, but that wouldn't take punctuation into account. So my next thought might be to split on any nonword character, using /W. That method would be a good option. I chose to go a little further, though. I wanted to include hyphenated words and words with apostrophes in my index; thus, I used the expression in the preceding paragraph.

In English, the split translates roughly to this: "Split the string $file on any series of one or more characters that do not belong to the set of word characters plus apostrophe and hyphen." What I end up with, then, is an array that contains (mostly) words. Of course, this result isn't perfect; I would end up indexing "words" such as 24-30 if I happened to have a document that referred to a week at the end of a month. But I can live with that result if someone can search for Mason-Dixon and find what she's expecting.

After building the array, I need to process it a little more before putting it in the DBM file. Processing each item in a foreach loop, I begin by deleting any leading or trailing single quotes or hyphens. With that task accomplished, I check to see whether the length of the item meets my criterion. If so, I make that item a key in the %words hash, with an arbitrary value of 1. I perform this processing until I've looked at all the members of @results.

When I'm done with this loop, %words contains keys for each significant word in the current file. Then I can iterate over these keys to put the appropriate information in the DBM file. For each word that occurs in %words, I append a delimiter and the current value of $shortfilename to the value of %final (which is the hash that points to the index DBM file), with the current word as the key. So if the current word is that, $_ equals that, and I will append # and $shortfilename to $final{'that'}.

All this happens for each file that occurs in @files, for each instance of @files, and in each instance of &scan_files. The procedure sounds like a great deal of work, and it is. But when all is said and done, the result is a DBM file that has a list of words as its keys. Each of these keys points to a string that contains one or more file names delimited by the pound-sign character (#). Each of these files contains one or more occurrences of that key. Does that explanation make sense? I hope so.

Performing a Search Using the Index File

All the work explained in the preceding section will be useless unless you find some use for this newly created DBM file, so I'd better show you the search function that goes along with the script. Assume that you're using the search form shown in figure 5.1 (refer to "Scanning Directories Using a Recursive Subroutine" earlier in this chapter). When the user enters some text and presses Return, he wants to see a list of documents that contain those words. Listing 5.7 shows the first part of a script that is intended to do just that.

Listing 5.7 Script to Perform a Search Using a DBM Index File (indexsearch.pl)


#!/usr/bin/perl

require "process_cgi.pl";

&parse_input(*fields);

$search=$fields{'word'};

&print_header;



@tag=split(/\0/,$fields{'tag'});

$tags=join ("::",@tag);



$urlsearch=$search;

$urlsearch=[td]s/ /\+/g;

$words=$search;

$words=[td]s/[^\w-' ]//g;

@words=split(/ +/, $words);



$directory='/usr/local/etc/httpd/htdocs';

dbmopen (%index,"index",0666);

$i=0;

The first section of this script performs all the preliminary steps needed before the bulk of the processing takes place. The script begins by parsing the input from the form and placing the search string in $search. The script then creates an array called @tag from the check boxes below the search-text box and then uses join() to create a single string that contains all the selected tags. Next, the script creates a string called $urlsearch. This string will be part of the URL that links to the returned pages. Then the script removes any characters from the search string that don't fit the criteria listed earlier in this chapter: any character that isn't a letter, a number, a hyphen, or a single quotation mark. After that, the script splits the search string into individual words in an array called (not surprisingly, I hope) @words.

TIP

You may notice that I silently ignore any illegal characters in the script shown in Listing 5.7. Some users may find this silence confusing or even annoying. Many Web users-particularly longtime Web users-expect to have complete control and knowledge of what's going on; they don't like things to go on behind the scenes (or, as they might say, behind their backs). If your site's target audience includes this type of user, you want to keep this attitude in mind. You probably will want to warn the user that he entered illegal characters in the search form instead of ignoring those characters.

Finally, the script sets the directory variable and opens the DBM file, ready to look for the search words. This process begins in Listing 5.8.

Listing 5.8 Part II of the DBM Search Script


foreach $word(@words) {

 undef(%mark);

 undef(@files);

 $files=$index{$word};

 $files=[td]s/^#//;

 @files=split(/#/,$files);

 grep ($mark{$_}++,@files);

 if ($i > 0) {

  @combined=grep($mark{$_},@oldfiles);

  @oldfiles=@combined;

 }

 else {

  @combined=@files;

  @oldfiles=@files;

 }

$i++;

}

dbmclose (%index);

The processing of the @words array takes place in a foreach loop. (You should have been able to guess that by now.) The script begins by undefining two arrays to make sure that they are empty at the beginning of each iteration of the loop. Then the script sets $files to equal $index{$word}, which puts the list of files from the index DBM file for the current value of $word in $files. Next, the script removes the leading # before splitting $files into the array @files.

The following syntax is the key to this entire process. The intention of this script is to find all the documents that contain all the words supplied by the user in the form. To do this, the script needs to compare the @files array from each iteration of the foreach $word(@words){ loop with all the other @files arrays from all iterations of the loop. Unfortunately, Perl has no built-in logical-and function for arrays, so we use magic in this listing. Unlike most magicians, though, I'll explain the trick.

The first grep() creates a key in the hash %mark for each item in @files, assigning an arbitrary value to that key. How? Recall exactly what grep does: It evaluates the first argument for each item in the list in the second argument, returning an array that contains all the items for which the first argument evaluated true. In this instance, you really don't care whether the argument evaluates true; you're just using the grep as a quick way of giving %mark the appropriate set of key-value pairs, so you don't even assign the result anywhere. Notice that you could just as easily have set values for %mark by using a foreach loop.

As you can see by the conditional if ($i > 0) {, the first time through the loop, this hash doesn't get used at all. The script simply sets @oldfiles and @combined to equal the current value of @files, and moves on to the next word in @words. If the user happened to type only one word, the script is done. The array @combined contains the list of files that you want to return to the user.

But if the user typed more than one word, you go through the loop again. Again, you set the %mark as described earlier in this section. This time, though, you use it. You set @combined to equal the result of grep($mark{$_},@oldfiles). Don't blink, or you'll miss the slick part of this. Remember that %mark has a key associated with a 1 for each item in the current @files array. The second time through this loop, @oldfiles contains the @files array from the preceding iteration of the loop. So when grep evaluates $mark{$_}, for each item in @oldfiles, the expression is true only if the current $_ exists as a key in %mark. The value $_ exists as a key in %mark if that value was in @files this time through the loop. The result is that @combined contains only those values that existed in both @files and @oldfiles. Get it?

Each successive time through the loop, @oldfiles becomes @combined, so that when you do the grep($mark{$_}, @oldfiles), you're anding to the proper array. This process ands together arrays until the cows come home. In the end, when you run out of words in @words, @combined contains the list of files that contain all the words in @words.

Now all that is left to do is run through @combined so that the script can return a page to the user. Listing 5.9 shows this process.

Listing 5.9 Final Section of indexsearch.pl


if ($#combined > -1) {

 foreach $list(@combined) {

  $title="${list}: No Title";

  $filename=$directory.$list;

  open file, $filename;

  while (<file>) {

   if (/<title>([^<]+)<\/title>/io) {	

    $title=$1;

    last;

   }  

  }

  close file;

  print "<a href=\"/cgi-bin/showfoundfile/$filename"."::".$urlsearch."::", 

   "$tags\">$title</a><br>\n";

 }

}

else {

 print "No matching documents found.";

}

The script begins by checking to see whether there are any elements at all in @combined. If $#combined is greater than -1, there is at least one element, so the script starts a foreach loop over @combined. Initially in this loop, I set a default title for the document, in case one does not exist in the document itself. Then I set $filename equal to the current element of @combined appended to the value of $directory. Remember that the index script shown earlier in the chapter shortened the file names that it stored so as to save space. I have to add back here what I removed earlier so that Perl can find the file.

When the script has the full path to the file in $filename, it opens that file and searches for the title. This code is identical to the code that found the title of the documents in the preceding search script. Now, with the title of the document in hand, the script prints that title hyper-linked to a URL that sends the user to that document. This process occurs for each file found in @combined, so the result is a page that lists all the titles for the documents that met the user's search criteria.

There is one major functional difference between this search and the one that you saw earlier in this chapter. In this search, the text that the user enters is treated as a list of words. This search returns all documents that contain all the listed words anywhere in the document. By contrast, the first search treated the entered text as a string; it returned only those documents that contained the entire string, spaces, punctuation, and all. You will want to take this fact into consideration when you decide which search to implement on your site.

Printing the Resulting Pages

You may wonder why printing the resulting pages deserves its own section. As I said earlier, I have a bit of a trick up my sleeve that has the potential to enhance any search function significantly. Many times, when I have searched a site, I ended up going to pages where I couldn't even find the text that I searched for. I realize that most modern browsers have a command that allows the user to find text. But as a programmer, I wondered whether I could build this functionality into the search. The answer, of course, was yes. The following sections explain how.

Returning Pages from the Nonindex Search

The first search function described in this chapter searched for a single word or phrase. A list of pages returned from that search might look like the page shown in figure 5.2.

Figure 5.2 : This page results from a user-initiated search.

Figure 5.2 shows what is obviously a simple page-just a list of files. Select one of those links, however, and see what happens.

Figure 5.3 clearly shows the additional feature of this search function. The number of occurrences of the search term is indicated at the top of the page, and there is a link to each occurrence. Notice, also, that each occurrence in the text is marked with the tags that the user chose on the form shown in figure 5.1. Finally, a back link takes the user from the current occurrence back to the top of the page, should she want to go back. With this feature, the user gets not only the pages that contain her search term, but also direct links to every occurrence of that search term on every returned page. Listing 5.10 shows the first part of the script that accomplishes this minor miracle.

Figure 5.3 : This page displays the information returned from the search.

Listing 5.10 Part I of a Script to Return a File from a Simple Search (showfoundfile.pl)


#!/usr/bin/perl



require "process_cgi.pl";

&print_header;

$temp=&path_info;

($file,$st,$tag1,$tag2,$tag3)=split ("::",$temp);



$file=~m#/usr/local/etc/httpd/htdocs/whitehorse(/.*)/#;

$urlstart=$1;



$starttag="<$tag1><$tag2><$tag3>";

$endtag="</$tag3></$tag2></$tag1>";

$starttag=~s/<>//g;

$endtag=~s/<\/>//g;

You shouldn't have too much trouble figuring out what's going on in Listing 5.10. The script begins by calling in the process_cgi library, printing the header, and grabbing the PATH_INFO variable; it then splits that variable into its components. Next, the script sets the variable $urlstart to contain the path to this file from the document root of the Web server. You may not recognize the m## pattern-match operator that I use for this purpose, but you have used this feature many times before. In those previous uses, you employed the standard delimiter, /. In such cases, you can (and normally do) leave off the m. I chose to perform this match in this fashion so that I wouldn't have to put backslashes in front of all the slashes in the string that I'm matching.

The final bit of processing in this section of the script sets up $starttag and $endtag. These variables function as their names indicate that they will: they contain the beginning and ending tags that mark the found text. You'll notice that the two substitutions that follow the initial assignments are necessary, because I don't know that any of the tag variables will contain any text at all. If any or all of the variables are empty, I have to get rid of the empty tags from the strings.

The code shown in Listing 5.11 begins by initializing $page for later use and decoding the search text. Next, the script opens the file that the user selected and initializes a couple more variables. Then the script walks through the file, using the standard while loop. The first thing that this loop does is check to see whether the line contains the flag text EEENNNDDD. If so, the script sets $dontmark equal to y. You'll notice in the line that follows that if $dontmark equals anything but n, the script doesn't do any processing on the current line. This flag feature allows you to mark the ending portions of certain files as being off-limits to marking. This feature can be helpful if, for example, you have a footer on a page that might get mangled if you start pasting new anchors into it. To activate this feature, I simply put EEENNNDDD in an HTML comment at the point in the document where I wanted the marking to stop.

Listing 5.11 Part II of showfoundfile


$page=' ';

$st =~ s/\+/ /g;

open (f,$file) || print "couldn't open file ${file}:$!";;

$iteration=1;

$dontmark='n';

while (<f>) {

 $dontmark='y' if /EEENNNDDD/;

 if ((!(/<[^>]*$st/i)) && (!(/[^><]*$st[^>]*>/i)) && (!(/<title>[^<]+<\/title>/i)) && 

 ($dontmark 

 eq 'n')) {

  if (/$st/i) {

   s/($st)/<a name=search${iteration}>${1}<\/a>(<a href=#searchtop>back<\/a>)/gio;

   s/($st)/${starttag}${1}${endtag}/gio;

   $iteration++;

  }

 }

 s#(<[^>]+)(href|src)(= *"*)(../)#$1$2$3$urlstart/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)(http:)#$1$2$3/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)([\w+])#$1$2$3$urlstart/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)(/)(http:)#$1$2$3$5#gi;

 push (@page,$_);

}

In addition to passing the $dontmark flag, the line has to pass three other tests before going on. The first two tests make sure that the line doesn't contain HTML tags that contain the search text. I choose to skip those lines that contain HTML tags, because they can cause significant confusion. I also don't want to do any additional marking inside the title tag, so I skip the line that contains the title.

After the line passes these tests, the script checks to see whether it contains the search text. If so, the script marks each occurrence of that text with a named anchor; it also adds a link that will take the user back to the top of the page. Finally, the script marks the found text with the tags that the user selected in the form. When the additional markup is complete, the script increments the $iteration variable, so that the next time the script finds the search text, the named anchor will have a unique identifier.

The next four lines are needed to change any URLs on the original page that are relative to that page's location into absolute URLs. If an image on the requested page is contained in the same directory as the page, for example, and is referred to by only its file name, without these substitutions, that picture would show up as a broken image on the page that this script outputs.

The first substitution takes care of any references to documents that are higher up in the directory structure. The second substitution temporarily puts a slash before any URLs that start with http:, so they are not affected by the line that follows. This line adds $urlstart before any references that begin with alphanumeric characters. (This situation would solve the problem in the example in the preceding paragraph.) The fourth substitution removes any slashes that were put before http: two lines earlier.

When all the appropriate substitutions are complete, the script adds the current line to the @page array, using the push function. When all the lines of the file have been processed, the script moves on to the code shown in Listing 5.12.

Listing 5.12 Final Section of the showfoundfile Script


$header="<p>";

for ($i=1;$i < $iteration;$i++) {

 push (@header,"<a href=#search$i>$i</a>");

}

$iteration-;

$header=join(" ",@header);

$header="<a name=searchtop><hr>\"$st\" was found in $iteration lines. 

Click on the numbers below to go to each occurrence.<p>".$header."<hr>" if $iteration > 1;

$header="<a name=searchtop><hr>\"$st\" was found once.

Click on the 1 below to find it.<p>".$header."<hr>" if $iteration == 1;

foreach $page(@page) {

 $page=[td]s/(<body[^>]*>)/$1$header/i;

 print $page;

}

The final section of showfoundfile begins by creating a header for the document. This header contains the links to each occurrence of the search text on the page. This process begins with a for loop that creates the list of numbered links, pushing each one into the @header array. When that task is finished, the script join( )s @header into a single string and then adds the explanatory text and formatting. You'll notice that if the search text occurs only once in the document, I use different text. Although this step isn't strictly necessary, it didn't cost much effort, and the result is better-looking output.

When it's done with the header, the script runs through the @page array created earlier and prints each line back to the user. When the script runs into the <body> tag for the document, it adds the $header text immediately after. The result is a page like the one shown in figure 5.3 earlier in this section.

As I described this script, you may have noticed that it is not perfect. The script requires all the HTML files to have a body tag, for example; it needs img and a tags to start on the same line as their src and href attributes. The script also may behave poorly if the search text was an HTML tag or an attribute to a tag. In the applications in which I've employed this system, these facts didn't matter, because the sites had solid HTML code to begin with-and also were sites whose users were extremely unlikely to search for HTML tags and attributes. You will want to take these limitations into consideration before you implement a search of this kind; you may need to modify it to fit your needs.

Returning Pages from the Index Search

As in the nonindex search described earlier in this chapter, I want to be able to show the user where her search terms occurred in the returned documents. Unfortunately, I can't use the same showfoundfile script, because in the index search, the search string is interpreted as a list of words, rather than a single phrase. When a user executes a search and selects a file from the resulting list, figure 5.4 shows what she would see.

Figure 5.4 : This page shows the information returned from the index search through showfoundindexfile.

The first part of the showfoundindexfile script is identical to the code shown in Listing 5.10. The next section varies, as you might expect, because this script looks at the search text as a list of words. You can see the differences in Listing 5.13.

Listing 5.13 Partial Listing of showfoundindexfile.pl


$page=' ';

$search =~ s/\+/ /g;

@words=split(/ +/,$search);

foreach (@words) {$iteration{$_}=1;};

open f,$file;

$dontmark='n';

while (<f>) {

 $dontmark='y' if /EEENNNDDD/;

 $i=1;

 foreach $st(@words) {

  if ((!(/<[^>]*$st/i)) && (!(/[^><]*$st[^>]*>/i)) && (!(/<title>[^<]+<\/title>/i)) && 

  ($dontmark eq 'n')) {

   if (/$st/i) {

    s/($st)/<a name="search${i}$iteration{$st}">${1}<\/a>(<a 

     href=#searchtop>back<\/a>)/gi;

    s/($st)/${starttag}${1}${endtag}/gi;

    $iteration{$st}++;

   }

  }

 $i++

 }

 s#(<[^>]+)(href|src)(= *"*)(../)#$1$2$3$urlstart/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)(http:)#$1$2$3/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)([\w+])#$1$2$3$urlstart/$4#gi;

 s#(<[^>]+)(href|src)(= *"*)(/)(http:)#$1$2$3$5#gi;

 push (@page,$_);

}

The first change that you'll notice is the splitting of the $st variable into the array @words. Immediately thereafter, the script initializes the hash %iteration. This hash performs the same function in this script that $iteration does in showfoundfile. After opening the file and setting $dontmark, the script runs through each line of the file with a while loop, just as the preceding script did.

Now, for each line, the script has to check for any occurrences of each word in @words. The code that checks for and marks the words is identical to that in showfoundfile; it's just embedded in a foreach loop that iterates over @words. The only change in the marking is the addition of the variable $i to the anchor name for each reference. This variable will contain one for the first word in @words, two for the second word, and so on. This makes each anchor name unique.

When the markup is complete, the script again has to resolve any partial URLs with the same four lines of code from showfoundfile. When that task is accomplished, the script pushes the line into the @page array and moves on. Listing 5.14 shows what goes on after the entire file has been processed.

Listing 5.14 End of the Partial Listing of showfoundindexfile


$j=1;

$header="<a name=searchtop>";

foreach $st(@words) {

 undef(@header);

 for ($i=1;$i < $iteration{$st};$i++) {

  push (@header,"<a href=#search$j$i>$i</a>");

 }

 $newheader = join(" ",@header);

 $iteration{$st}-;

 $newheader="<hr>\"$st\" was found $iteration{$st} times. 

 Click on the numbers below to go to each occurrence.<p>".$newheader."<hr>"

 if $iteration{$st} > 1;

 $newheader="<hr>\"$st\" was found once. 

 Click on the 1 below to find it.<p>".$newheader."<hr>" if $iteration{$st} == 1;

 $j++;

 $header .= $newheader;

}

foreach $page(@page) {

 $page=[td]s/(<body[^>]*>)/$1$header/i;

 print $page;

}

print $header

print @words;

After initializing $header, the script scans through @words. The processing that takes place in the loop builds the header for each word in @words. This process should look familiar. The major differences between this code and the same section in showfoundfile are the use of the hash %iteration in place of the scalar $iteration and the fact that I had to put the $j in front of $iteration{$st} in the anchor href for each occurrence of the current word. As soon as the header for each word is complete, I append it to the variable $header. So when this loop is finished, $header will contain a header for each word that the user searched for.

The end of this script is identical to showfoundfile. The script loops through the @page array, printing each line and adding the header text where appropriate.

From Here...

This chapter exposed you to two effective methods for searching a Web site. For smaller sites, I demonstrated a full-text search that looks for a user-provided word or phrase in every file on a site. As an alternative for larger sites, you saw a method to index every significant word in the server's document structure. I also demonstrated a search method to go along with this indexing. Finally, I showed you an alternative method for returning pages from these searches to the user.

With these tools, you can implement your own search feature on your site. You may want to branch out in a different direction. Following are some suggested destinations:

Chapter 6 "Using Dynamic Pages." Netscape and Microsoft have implemented several innovative features in their browsers recently. This chapter demonstrates a few ways to take advantage of advanced features such as plug-ins, client pull, and server push.
Chapter 11, "Database Interaction." Many people say that databases are key to the future of the Web. I agree. You may wonder how the average Webmaster can implement database technology on his or her site. Chapter 11 will help you take that leap.
Chapter 14, "Operators." This chapter introduces several new Perl operators, but there are still many that you haven't seen yet. Head to Chapter 14 for the full details on all those nifty operators.