Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
URI::Find(3)	      User Contributed Perl Documentation	  URI::Find(3)

       URI::Find - Find	URIs in	arbitrary text

	 require URI::Find;

	 my $finder = URI::Find->new(\&callback);

	 $how_many_found = $finder->find(\$text);

       This module does	one thing: Finds URIs and URLs in plain	text.  It
       finds them quickly and it finds them all	(or what	considers a
       URI to be.)  It only finds URIs which include a scheme (http:// or the
       like), for something a bit less strict have a look at

       For a command-line interface, urifind is	provided.

   Public Methods
	     my	$finder	= URI::Find->new(\&callback);

	   Creates a new URI::Find object.

	   &callback is	a function which is called on each URI found.  It is
	   passed two arguments, the first is a	URI object representing	the
	   URI found.  The second is the original text of the URI found.  The
	   return value	of the callback	will replace the original URI in the

	     my	$how_many_found	= $finder->find(\$text);

	   $text is a string to	search and possibly modify with	your callback.

	   Alternatively, "find" can be	called with a replacement function for
	   the rest of the text:

	     use CGI qw(escapeHTML);
	     # ...
	     my	$how_many_found	= $finder->find(\$text,	\&escapeHTML);

	   will	not only call the callback function for	every URL found	(and
	   perform the replacement instructions	therein), but also run the
	   rest	of the text through "escapeHTML()". This makes it easier to
	   turn	plain text which contains URLs into HTML (see example below).

   Protected Methods
       I got a bunch of	mail from people asking	if I'd add certain features to
       URI::Find.  Most	wanted the search to be	less restrictive, do more
       heuristics, etc...  Since many of the requests were contradictory, I'm
       letting people create their own custom subclasses to do what they want.

       The following are methods internal to URI::Find which a subclass	can
       override	to change the way URI::Find acts.  They	are only to be called
       inside a	URI::Find subclass.  Users of this module are NOT to use these

	     my	$uri_re	= $self->uri_re;

	   Returns the regex for finding absolute, schemed URIs
	   ( and such).  This, combined with
	   schemeless_uri_re() is what finds candidate URIs.

	   Usually this	method does not	have to	be overridden.

	     my	$schemeless_re = $self->schemeless_uri_re;

	   Returns the regex for finding schemeless URIs ( and
	   such) and other things which	might be URIs.	By default this	will
	   match nothing (though it used to try	to find	schemeless URIs	which
	   started with	"www" and "ftp").

	   Many	people will want to override this method.  See
	   URI::Find::Schemeless for a subclass	does a reasonable job of
	   finding URIs	which might be missing the scheme.

	     my	$uric_set = $self->uric_set;

	   Returns a set matching the 'uric' set defined in RFC	2396 suitable
	   for putting into a character	set ([]) in a regex.

	   You almost never have to override this.

	     my	$cruft_set = $self->cruft_set;

	   Returns a set of characters which are considered garbage.  Used by

	     my	$uri = $self->decruft($uri);

	   Sometimes garbage characters	like periods and parenthesis get
	   accidentally	matched	along with the URI.  In	order for the URI to
	   be properly identified, it must sometimes be	"decrufted", the
	   garbage characters stripped.

	   This	method takes a candidate URI and strips	off any	cruft it

	     my	$uri = $self->recruft($uri);

	   This	method puts back the cruft taken off with decruft().  This is
	   necessary because the cruft is destructively	removed	from the
	   string before invoking the user's callback, so it has to be put
	   back	afterwards.

	     my	$schemed_uri = $self->schemeless_to_schemed($schemeless_uri);

	   This	takes a	schemeless URI and returns an absolute,	schemed	URI.
	   The standard	implementation supplies	ftp:// for URIs	which start
	   with	ftp., and http:// otherwise.


	   Returns whether or not the given URI	is schemed or schemeless.
	   True	for schemed, false for schemeless.

	     __PACKAGE__->badinvo($extra_levels, $msg)

	   This	is used	to complain about bogus	subroutine/method invocations.
	   The args are	optional.

   Old Functions
       The old find_uri() function is still around and it works, but its

       Store a list of all URIs	(normalized) in	the document.

	 my @uris;
	 my $finder = URI::Find->new(sub {
	     my($uri) =	shift;
	     push @uris, $uri;

       Print the original URI text found and the normalized representation.

	 my $finder = URI::Find->new(sub {
	     my($uri, $orig_uri) = @_;
	     print "The	text '$orig_uri' represents '$uri'\n";
	     return $orig_uri;

       Check each URI in document to see if it exists.

	 use LWP::Simple;

	 my $finder = URI::Find->new(sub {
	     my($uri, $orig_uri) = @_;
	     if( head $uri ) {
		 print "$orig_uri is okay\n";
	     else {
		 print "$orig_uri cannot be found\n";
	     return $orig_uri;

       Turn plain text into HTML, with each URI	found wrapped in an HTML

	 use CGI qw(escapeHTML);
	 use URI::Find;

	 my $finder = URI::Find->new(sub {
	     my($uri, $orig_uri) = @_;
	     return qq|<a href="$uri">$orig_uri</a>|;
	 $finder->find(\$text, \&escapeHTML);
	 print "<pre>$text</pre>";

       Will not	find URLs with Internationalized Domain	Names or pretty	much
       any non-ascii stuff in them.  See

       Michael G Schwern <> with insight from Uri Gutman,
       Greg Bacon, Jeff	Pinyan,	Roderick Schertler and others.

       Roderick	Schertler <> maintained versions 0.11	to

       Darren Chamberlain wrote	urifind.

       Copyright 2000, 2009-2010, 2014,	2016 by	Michael	G Schwern

       This program is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.


       urifind,	URI::Find::Schemeless, URI, RFC	3986 Appendix C

perl v5.32.0			  2020-08-08			  URI::Find(3)


Want to link to this manual page? Use this URL:

home | help