[ale] sanity check article...

Mon Feb 4 20:50:25 EST 2002

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Here is my $0.02 USD on the actual perl script:

* Always add 'use strict' at the top of your scripts, 
especially if you are introducing perl to new people.

* Check the results of the functions open and opendir. 
If they fail, you want to know about it right away, 
and not go any further:

opendir (DIR,"$directory") 
  or die "Could not open directory $irectory: $!\n";

open(FILE,"$directory/$file")
  or die "Could not open $firectory/$file: $!\n";

* This line:

| close DIR;  # close the directory handle (usually automatic)

should be "closedir" not "close." Also, if you are specifically 
closing the dir handle, you may as well close the file handle 
as well at the end of the while loop.

* The regex needs a conditional to work properly (see below)

That's the biggies. I'm away from home with some time to kill :), 
so I'll go over it some more in detail below. Some are definite 
gotchas, others are merely Another Way To Do It.

* Try to avoid hard-coding machine-specific info whenever 
possible (and if you must, put it at the top of your 
script, as you've aready done). For example, instead of:

| my $url = 'http://myhost.mydomain.com'; # baseurl here

try this:

my $url = "http://$ENV{'SERVER_NAME'}";

or even:

my $url = sprintf 
  "http://$ENV{'SERVER_NAME'}%s",
  $ENV{'SERVER_PORT'}==80 ? "" : ":$ENV{'SERVER_PORT'}";

(only if you know this is going to be run over the web and 
your web server sets these environment variables (Apache does) 
You could wrap it in some additional logic for commmand-line 
testing.)

* This loop:

| if ($file =~ /\.html$/) {       # is it an html file?
|  ...
| }

has no matching else, so we can save some brackets and make 
things a little clearer by turning it into a quick test:

next unless $file =~ /\.html$/;

* There is no need to put parens around all three of the regex 
sections, if all we care about is the second one:

| ($title = $line) =~ s#(<title>)(.*)(</title>)#$2#i;

becomes:

($title = $line) =~ s#<title>(.*)</title>#$1#i;

* The "line" and "title" variables are not really needed, 
but do help clarify for newbies. A close call, but I would 
eliminate at least one of them.

* The inner loop, which checks for a line with "title", 
does not exit once a title is found. Simply add a "last" 
to the end of the if ($line =~ /<title>/i) loop.

* We do not need to chomp, since the regex is only looking 
for a closing title tag, and does not care about line endings.

* The print line after the regex match should be wrapped inside 
a conditional: we only want to print if we managed to 
actually make the substitution. In other words, the substitution 
will fail if the closing title tag is not found, in which case 
we will end up printing the entire line that containts the 
title tag! We should also account for cases in which no title 
tag was found (using, perhaps, the file name, un-bolded to 
differentiate it). We can also skip the substitution, and 
simply use a match:

if ($line =~ m#<title>(.*)</title>#i) {
  # print the line with fresh hyperlinks
  print p( a({href=>"$url/$dir/$file"}, "<b>$1</b>"));
}

later on as an "else":
{
  print p( a({href=>"$url/$dir/$file"}, "$file"));
}

* Parsing html is a very advanced skill: it is far better to 
use the modules already available to do the job. However, the 
above probably suffices in this case. For more information:

http://search.cpan.org/search?dist=HTML-Parser

I would at the very least warn people of this: this particular 
wheel tends to get re-invented over and over. There are a few 
things that regexs are not good for. :)

* We can use the handy perl "grep" function to combine the 
readdir and the "html extension" check into one. We could 
also store the results into an array, which allows us to 
close the directory handle quicker, and have some cleaner 
code (at the expense of a small amount of memory). While 
we're there, we also make the final "l" in "html" optional 
for those that might still play by the silly DOS 8.3 rules:

| while ($file = readdir(DIR)) {   # read a file at a time
|  if ($file =~ /\.html$/) {       # is it an html file?

my @files = grep {/\.html?$/) readdir DIR;
for my $file (@files) {

or even combine the two to regain that memory:

for my $file (grep {/\.html?$/) readdir DIR) {

* As an added bonus to putting all the entries into an array 
first, we also know how many files we read in, and can display 
that at the top of the page:

print h2("Files found: ". at files);

* The CGI printing functions should be prefixed with a print:

| print h1("Welcome to my Recipe Collection");

* As a final garnish, let's sort the entries by name by passing 
the result of the grep into a normal, default sort:

my @files = sort grep {/\.html?$/} readdir DIR;

The final version:

#!/usr/bin/perl -w

use CGI qw(:standard);     # Here's what makes it all possible
use strict;

my $directory = '/var/www/myfiles';     # web files located here
my $url = sprintf                       # baseurl here
  "http://$ENV{'SERVER_NAME'}%s",
  $ENV{'SERVER_PORT'}==80 ? "" : ":$ENV{'SERVER_PORT'}";

my $dir = 'myfiles';                    # basename of dir where
                                        # web files are

print header(), start_html("My Recipes"), "\n";

print h1("Welcome to my Recipe Collection"),"\n";

print p("I was inspired by " .
a({href=>"http://www.computoredge.com"}, "this website") . " to
create a dynamically generated web page."),"\n";

print p("The following code will generate a list of all HTML
files in the current directory:"),"\n";

## Open a directory handle:
opendir (DIR,"$directory")
  or die "Could not opendir $directory: $!\n";

## Read in files ending in htm or html, sort, and put in an array
my @files = sort grep {/\.html?$/} readdir DIR;

print h2("Files found: ". at files), "\n";

## Open each file in turn and try to find the title
for my $file (@files) {
  open(FILE,"$directory/$file")
    or die "Could not open $directory/$file: $!\n";
  my $titlefound=0;
  while (<FILE>) {
    if (m#<title>(.*)</title>#i) {
      print p( a({href=>"$url/$dir/$file"}, "\n <b>$1</b>"));
      ++$titlefound and last; ## Ignore the rest of the file
    }
  }
  close(FILE); # close the file handle (usually automatic)
  if (!$titlefound) {
    print p( a({href=>"$url/$dir/$file"}, "\n $file"));
  }
  print "\n";
}
closedir(DIR);  # close the directory handle (usually automatic)
print end_html();  # print </body></html>

exit;

Yes, I know some of the things (e.g. the sort/grep/readdir line) 
may be a bit much for beginners, but it was fun rewriting this, 
and hopefully someone learned something along the way. I'd 
personally remove the close and closedir : perl handles this 
so well automatically it is rarely needed expicitly and the 
open/opendir success tests are much more important.

Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200202042039
Just Another Perl Hacker

-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iQA/AwUBPF86ULybkGcUlkrIEQKjZACfcamZ+rftRFE2qbPYT4Kb3Fcoyx4An0/D
jzS5+D5uKW824olPsMXE23zH
=Hyl3
-----END PGP SIGNATURE-----

---
This message has been sent through the ALE general discussion list.
See http://www.ale.org/mailing-lists.shtml for more info. Problems should be 
sent to listmaster at ale dot org.