Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
MCE::Grep(3)	      User Contributed Perl Documentation	  MCE::Grep(3)

NAME
       MCE::Grep - Parallel grep model similar to the native grep function

VERSION
       This document describes MCE::Grep version 1.874

SYNOPSIS
	## Exports mce_grep, mce_grep_f, and mce_grep_s
	use MCE::Grep;

	## Array or array_ref
	my @a =	mce_grep { $_ %	5 == 0 } 1..10000;
	my @b =	mce_grep { $_ %	5 == 0 } \@list;

	## Important; pass an array_ref	for deeply input data
	my @c =	mce_grep { $_->[1] % 2 == 0 } [	[ 0, 1 ], [ 0, 2 ], ...	];
	my @d =	mce_grep { $_->[1] % 2 == 0 } \@deeply_list;

	## File	path, glob ref,	IO::All::{ File, Pipe, STDIO } obj, or scalar ref
	## Workers read	directly and not involve the manager process
	my @e =	mce_grep_f { /pattern/ } "/path/to/file"; # efficient

	## Involves the	manager	process, therefore slower
	my @f =	mce_grep_f { /pattern/ } $file_handle;
	my @g =	mce_grep_f { /pattern/ } $io;
	my @h =	mce_grep_f { /pattern/ } \$scalar;

	## Sequence of numbers (begin, end [, step, format])
	my @i =	mce_grep_s { %_	* 3 == 0 } 1, 10000, 5;
	my @j =	mce_grep_s { %_	* 3 == 0 } [ 1,	10000, 5 ];

	my @k =	mce_grep_s { %_	* 3 == 0 } {
	   begin => 1, end => 10000, step => 5,	format => undef
	};

DESCRIPTION
       This module provides a parallel grep implementation via Many-Core
       Engine.	MCE incurs a small overhead due	to passing of data. A fast
       code block will run faster natively. However, the overhead will likely
       diminish	as the complexity increases for	the code.

	my @m1 =     grep { $_ % 5 == 0	} 1..1000000;	       ## 0.065	secs
	my @m2 = mce_grep { $_ % 5 == 0	} 1..1000000;	       ## 0.194	secs

       Chunking, enabled by default, greatly reduces the overhead behind the
       scene.  The time	for mce_grep below also	includes the time for data
       exchanges between the manager and worker	processes. More
       parallelization will be seen when the code incurs additional CPU	time.

	my @m1 =     grep { /[2357][1468][9]/ }	1..1000000;    ## 0.353	secs
	my @m2 = mce_grep { /[2357][1468][9]/ }	1..1000000;    ## 0.218	secs

       Even faster is mce_grep_s; useful when input data is a range of
       numbers.	 Workers generate sequences mathematically among themselves
       without any interaction from the	manager	process. Two arguments are
       required	for mce_grep_s (begin, end). Step defaults to 1	if begin is
       smaller than end, otherwise -1.

	my @m3 = mce_grep_s { /[2357][1468][9]/	} 1, 1000000;  ## 0.165	secs

       Although	this document is about MCE::Grep, the MCE::Stream module can
       write results immediately without waiting for all chunks	to complete.
       This is made possible by	passing	the reference to an array (in this
       case @m4	and @m5).

	use MCE::Stream	default_mode =>	'grep';

	my @m4;	mce_stream \@m4, sub { /[2357][1468][9]/ }, 1..1000000;

	   ## Completed	in 0.203 secs. This is amazing considering the
	   ## overhead for passing data	between	the manager and	workers.

	my @m5;	mce_stream_s \@m5, sub { /[2357][1468][9]/ }, 1, 1000000;

	   ## Completed	in 0.120 secs. Like with mce_grep_s, specifying	a
	   ## sequence specification turns out to be faster due	to lesser
	   ## overhead for the manager process.

       A common	scenario is grepping for pattern(s) inside a massive log file.
       Notice how parallelism increases	as complexity increases	for the
       pattern.	 Testing was done against a 300	MB file	containing 250k	lines.

	use MCE::Grep;

	my @m; open my $LOG, "<", "/path/to/log/file" or die "$!\n";

	@m = grep { /pattern/ }	<$LOG>;			     ##	 0.756 secs
	@m = grep { /foobar|[2357][1468][9]/ } <$LOG>;	     ##	24.681 secs

	## Parallelism with mce_grep. This involves the	manager	process
	## due to processing a file handle.

	@m = mce_grep {	/pattern/ } <$LOG>;		     ##	 0.997 secs
	@m = mce_grep {	/foobar|[2357][1468][9]/ } <$LOG>;   ##	 7.439 secs

	## Even	faster with mce_grep_f.	Workers	access the file	directly
	## with	zero interaction from the manager process.

	my $LOG	= "/path/to/file";
	@m = mce_grep_f	{ /pattern/ } $LOG;		     ##	 0.112 secs
	@m = mce_grep_f	{ /foobar|[2357][1468][9]/ } $LOG;   ##	 6.840 secs

PARSING	HUGE FILES
       The MCE::Grep module lacks an optimization for quickly determining if a
       match is	found from not knowing the pattern inside the code block. Use
       the following snippet as	a template to achieve better performance.
       Also, take a look at examples/egrep.pl, included	with the distribution.

	use MCE::Loop;

	MCE::Loop->init(
	   max_workers => 8, use_slurpio => 1
	);

	my $pattern  = 'karl';
	my $hugefile = 'very_huge.file';

	my @result = mce_loop_f	{
	   my ($mce, $slurp_ref, $chunk_id) = @_;

	   ## Quickly determine	if a match is found.
	   ## Process slurped chunk only if true.

	   if ($$slurp_ref =~ /$pattern/m) {
	      my @matches;

	      ## The following is fast on Unix.	Performance degrades
	      ## drastically on	Windows	beyond 4 workers.

	      open my $MEM_FH, '<', $slurp_ref;
	      binmode $MEM_FH, ':raw';
	      while (<$MEM_FH>)	{ push @matches, $_ if (/$pattern/); }
	      close   $MEM_FH;

	      ## Therefore, use	the following construct	on Windows.

	      while ( $$slurp_ref =~ /([^\n]+\n)/mg ) {
		 my $line = $1;	# save $1 to not lose the value
		 push @matches,	$line if ($line	=~ /$pattern/);
	      }

	      ## Gather	matched	lines.

	      MCE->gather(@matches);
	   }

	} $hugefile;

	print join('', @result);

OVERRIDING DEFAULTS
       The following list options which	may be overridden when loading the
       module.

	use Sereal qw( encode_sereal decode_sereal );
	use CBOR::XS qw( encode_cbor decode_cbor );
	use JSON::XS qw( encode_json decode_json );

	use MCE::Grep
	    max_workers	=> 4,		     # Default 'auto'
	    chunk_size => 100,		     # Default 'auto'
	    tmp_dir => "/path/to/app/tmp",   # $MCE::Signal::tmp_dir
	    freeze => \&encode_sereal,	     # \&Storable::freeze
	    thaw => \&decode_sereal	     # \&Storable::thaw
	;

       From MCE	1.8 onwards, Sereal 3.015+ is loaded automatically if
       available.  Specify "Sereal => 0" to use	Storable instead.

	use MCE::Grep Sereal =>	0;

CUSTOMIZING MCE
       MCE::Grep->init ( options )
       MCE::Grep::init { options }

       The init	function accepts a hash	of MCE options.	The gather option, if
       specified, is ignored due to being used internally by the module.

	use MCE::Grep;

	MCE::Grep->init(
	   chunk_size => 1, max_workers	=> 4,

	   user_begin => sub {
	      print "##	", MCE->wid, " started\n";
	   },

	   user_end => sub {
	      print "##	", MCE->wid, " completed\n";
	   }
	);

	my @a =	mce_grep { $_ %	5 == 0 } 1..100;

	print "\n", "@a", "\n";

	-- Output

	## 2 started
	## 3 started
	## 1 started
	## 4 started
	## 3 completed
	## 4 completed
	## 1 completed
	## 2 completed

	5 10 15	20 25 30 35 40 45 50 55	60 65 70 75 80 85 90 95	100

API DOCUMENTATION
       MCE::Grep->run (	sub { code }, list )
       mce_grep	{ code } list

       Input data may be defined using a list or an array reference. Unlike
       MCE::Loop, Flow,	and Step, specifying a hash reference as input data
       isn't allowed.

	## Array or array_ref
	my @a =	mce_grep { /[2357]/ } 1..1000;
	my @b =	mce_grep { /[2357]/ } \@list;

	## Important; pass an array_ref	for deeply input data
	my @c =	mce_grep { $_->[1] =~ /[2357]/ } [ [ 0,	1 ], [ 0, 2 ], ... ];
	my @d =	mce_grep { $_->[1] =~ /[2357]/ } \@deeply_list;

	## Not supported
	my @z =	mce_grep { ... } \%hash;

       MCE::Grep->run_file ( sub { code	}, file	)
       mce_grep_f { code } file

       The fastest of these is the /path/to/file. Workers communicate the next
       offset position among themselves	with zero interaction by the manager
       process.

       "IO::All" { File, Pipe, STDIO } is supported since MCE 1.845.

	my @c =	mce_grep_f { /pattern/ } "/path/to/file";  # faster
	my @d =	mce_grep_f { /pattern/ } $file_handle;
	my @e =	mce_grep_f { /pattern/ } $io;		   # IO::All
	my @f =	mce_grep_f { /pattern/ } \$scalar;

       MCE::Grep->run_seq ( sub	{ code }, $beg,	$end [,	$step, $fmt ] )
       mce_grep_s { code } $beg, $end [, $step,	$fmt ]

       Sequence	may be defined as a list, an array reference, or a hash
       reference.  The functions require both begin and	end values to run.
       Step and	format are optional. The format	is passed to sprintf (%	may be
       omitted below).

	my ($beg, $end,	$step, $fmt) = (10, 20,	0.1, "%4.1f");

	my @f =	mce_grep_s { /[1234]\.[5678]/ }	$beg, $end, $step, $fmt;
	my @g =	mce_grep_s { /[1234]\.[5678]/ }	[ $beg,	$end, $step, $fmt ];

	my @h =	mce_grep_s { /[1234]\.[5678]/ }	{
	   begin => $beg, end => $end,
	   step	=> $step, format => $fmt
	};

       MCE::Grep->run (	sub { code }, iterator )
       mce_grep	{ code } iterator

       An iterator reference may be specified for input_data. Iterators	are
       described under section "SYNTAX for INPUT_DATA" at MCE::Core.

	my @a =	mce_grep { $_ %	3 == 0 } make_iterator(10, 30, 2);

MANUAL SHUTDOWN
       MCE::Grep->finish
       MCE::Grep::finish

       Workers remain persistent as much as possible after running. Shutdown
       occurs automatically when the script terminates.	Call finish when
       workers are no longer needed.

	use MCE::Grep;

	MCE::Grep->init(
	   chunk_size => 20, max_workers => 'auto'
	);

	my @a =	mce_grep { ... } 1..100;

	MCE::Grep->finish;

INDEX
       MCE, MCE::Core

AUTHOR
       Mario E.	Roy, <marioeroyA ATA gmailA DOTA com>

perl v5.32.0			  2020-08-18			  MCE::Grep(3)

NAME | VERSION | SYNOPSIS | DESCRIPTION | PARSING HUGE FILES | OVERRIDING DEFAULTS | CUSTOMIZING MCE | API DOCUMENTATION | MANUAL SHUTDOWN | INDEX | AUTHOR

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=MCE::Grep&sektion=3&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help