[ale] sed regexp question

Joseph A. Knapka jknapka at earthlink.net
Tue Jul 10 15:58:17 EDT 2001


Christopher Bergeron wrote:
> 
> That would only get websites that start with www;  I can't predict all the
> possible names that might arise.  i do know that the url is always encoded
> in a page as:
> 
> <A HREF="http://xxx.pornsite.com/pictures1.html/">
> 
> so, all I need to do is take everything between the "http:// and the ">
> 
> any suggestions?

Here's a briefish Tcl script that will do it:

#!/usr/local/bin/tclsh
set chan [open [lindex $argv 0] r]
while {![eof $chan]} {
        set line [gets $chan]
        while {[regexp -nocase -indices {href="*(http:[^">]*)[">]} $line
match match1]} {
                puts [eval string range {$line} $match1]
                set line [string range $line [lindex $match1 1] end]
        }
}


Save as "geturls", chmod u+x, and invoke as "geturls <filename>".
You may need to adjust the #! line to point to your tclsh. This
needs Tcl 8.3 or better, due to the "-indices" option to regexp.
A simpler version that won't catch multiple URLs on the same
line (only the first), but will run under any version of
Tcl:

#!/usr/local/bin/tclsh
set chan [open [lindex $argv 0] r]
while {![eof $chan]} {
        set line [gets $chan]
        while {[regexp -nocase {href="*(http:[^">]*)[">]} $line match
match1]} {
                puts $match1
        }
}

Of course you could do it in Perl with a one liner, but you'd
have to wash your hands afterwards >-). The regexp is the important
thing: anything that understands extended regexps and allows you
to capture matches within the regexp will work. If you use plain
grep you then need to postprocess the output to trim away the
parts of the line outside the URL proper, which is essentially
just as hard as finding the right lines in the first place, so
it makes more sense to use Perl or awk or Tcl or....

-- Joe

> would SED or GREP be better suited for this, and even better, what is the
> way to do it?!
> 
> thanks again for all the leads...
> 
> Christopher Bergeron
> Systems Administrator
> Full Line Distributors
> (770) 416-4237
> mis at fullline.com
> 
> > -----Original Message-----
> > From: I. Herman [mailto:izzmo at mediaone.net]
> > Sent: Tuesday, July 10, 2001 1:41 PM
> > To: Christopher Bergeron
> > Subject: Re: [ale] sed regexp question
> >
> >
> > what's the html file?  You can try:
> >
> > cat whatever.html | grep http | grep www
> >
> > or something like that...not sure what you are trying to do...i'm not
> > familiar w/ sed
> >
> >
> >
> 
> --
> To unsubscribe: mail majordomo at ale.org with "unsubscribe ale" in message body.

-- 
"You know how many remote castles there are along the gorges? You
 can't MOVE for remote castles!" -- Lu Tze re. Uberwald
// Linux MM Documentation in progress:
// http://home.earthlink.net/~jknapka/linux-mm/vmoutline.html
* Evolution is an "unproven theory" in the same sense that gravity is. *
--
To unsubscribe: mail majordomo at ale.org with "unsubscribe ale" in message body.





More information about the Ale mailing list