Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
rwdedupe(1)			SiLK Tool Suite			   rwdedupe(1)

NAME
       rwdedupe	- Eliminate duplicate SiLK Flow	records

SYNOPSIS
	 rwdedupe [--ignore-fields=FIELDS] [--packets-delta=NUM]
	       [--bytes-delta=NUM] [--stime-delta=NUM] [--duration-delta=NUM]
	       [--temp-directory=DIR_PATH] [--buffer-size=SIZE]
	       [--note-add=TEXT] [--note-file-add=FILE]
	       [--compression-method=COMP_METHOD] [--print-filenames]
	       [--output-path=PATH] [--site-config-file=FILENAME]
	       {[--xargs] | [--xargs=FILENAME] | [FILE [FILE ...]]}

	 rwdedupe --help

	 rwdedupe --help-fields

	 rwdedupe --version

DESCRIPTION
       rwdedupe	reads SiLK Flow	records	from one or more input sources.
       Records that appear in the input	file(s)	multiple times will only
       appear in the output stream once; that is, duplicate records are	not
       written to the output.  The SiLK	Flows are written to the file
       specified by the	--output-path switch or	to the standard	output when
       the --output-path switch	is not provided	and the	standard output	is not
       connected to a terminal.

       Note: As	part of	its processing,	rwdedupe re-orders the records before
       writing them.

       rwdedupe	reads SiLK Flow	records	from the files named on	the command
       line or from the	standard input when no file names are specified	and
       --xargs is not present.	To read	the standard input in addition to the
       named files, use	"-" or "stdin" as a file name.	If an input file name
       ends in ".gz", the file is uncompressed as it is	read.  When the
       --xargs switch is provided, rwdedupe reads the names of the files to
       process from the	named text file	or from	the standard input if no file
       name argument is	provided to the	switch.	 The input to --xargs must
       contain one file	name per line.

       By default, rwdedupe will consider one record to	be a duplicate of
       another when all	the fields in the records match	exactly.  From another
       point on	view, any difference in	two records results in both records
       appearing in the	output.	 Note that all means every field that exists
       on a SiLK Flow record.  The complete list of fields is specified	in the
       description of --ignore-fields in the "OPTIONS" section below.

       To have rwdedupe	ignore fields in the comparison, specify those fields
       in the --ignore-fields switch.  When --ignore-fields=FIELDS is
       specified, a record is considered a duplicate of	another	if all fields
       except those in FIELDS match exactly.  rwdedupe will treat FIELDS as
       being identical across all records.  Put	another	way, if	the only
       difference between two records is in the	FIELDS fields, only one	of
       those records will be written to	the output.

       The --packets-delta, --bytes-delta, --stime-delta and --duration-delta
       switches	allow for "fuzziness" in the input.  For example, if
       --stime-delta=NUM is specified and the only difference between two
       records is in the sTime fields, and the fields are within NUM
       milliseconds of each other, only	one record will	be written to the
       output.

       During its processing, rwdedupe will try	to allocate a large (near 2GB)
       in-memory array to hold the records.  (You may use the --buffer-size
       switch to change	this maximum buffer size.)  If more records are	read
       than will fit into memory, the in-core records are temporarily stored
       on disk as described by the --temp-directory switch.  When all records
       have been read, the on-disk files are merged to produce the output.

       By default, the temporary files are stored in the /tmp directory.
       Because of the sizes of the temporary files, it is strongly recommended
       that /tmp not be	used as	the temporary directory, and rwdedupe will
       print a warning when /tmp is used.  To modify the temporary directory
       used by rwdedupe, provide the --temp-directory switch, set the
       SILK_TMPDIR environment variable, or set	the TMPDIR environment
       variable.

OPTIONS
       Option names may	be abbreviated if the abbreviation is unique or	is an
       exact match for an option.  A parameter to an option may	be specified
       as --arg=param or --arg param, though the first form is required	for
       options that take optional parameters.

       --ignore-fields=FIELDS
	   Ignore the fields listed in FIELDS when determining if two flow
	   records are identical; that is, treat FIELDS	as being identical
	   across all flows.  By default, all fields are treated as
	   significant.

	   FIELDS is a comma separated list of field-names, field-integers,
	   and ranges of field-integers; a range is specified by separating
	   the start and end of	the range with a hyphen	(-).  Field-names are
	   case-insensitive.  Example:

	    --ignore-fields=stime,12-15

	   The list of supported fields	are:

	   sIP,1
	       source IP address

	   dIP,2
	       destination IP address

	   sPort,3
	       source port for TCP and UDP, or equivalent

	   dPort,4
	       destination port	for TCP	and UDP, or equivalent

	   protocol,5
	       IP protocol

	   packets,pkts,6
	       packet count

	   bytes,7
	       byte count

	   flags,8
	       bit-wise	OR of TCP flags	over all packets

	   sTime,9
	       starting	time of	flow (milliseconds resolution)

	   duration,10
	       duration	of flow	(milliseconds resolution)

	   sensor,12
	       name or ID of sensor at the collection point

	   in,13
	       router SNMP input interface or vlanId if	packing	tools were
	       configured to capture it	(see sensor.conf(5))

	   out,14
	       router SNMP output interface or postVlanId

	   nhIP,15
	       router next hop IP

	   class,20,type,21
	       class and type of sensor	at the collection point	(represented
	       internally by a single value)

	   initialFlags,26
	       TCP flags on first packet in the	flow

	   sessionFlags,27
	       bit-wise	OR of TCP flags	over all packets except	the first in
	       the flow

	   attributes,28
	       flow attributes set by flow generator

	   application,29
	       guess as	to the content of the flow.  Some software that
	       generates flow records from packet data,	such as	yaf(1),	will
	       inspect the contents of the packets that	make up	a flow and use
	       traffic signatures to label the content of the flow.  SiLK
	       calls this label	the application; yaf refers to it as the
	       appLabel.  The application is the port number that is
	       traditionally used for that type	of traffic (see	the
	       /etc/services file on most UNIX systems).  For example, traffic
	       that the	flow generator recognizes as FTP will have a value of
	       21, even	if that	traffic	is being routed	through	the standard
	       HTTP/web	port (80).

       --packets-delta=NUM
	   Treat the packets field on two records as being the same if the
	   values differ by NUM	packets	or less.  If not specified, the
	   default is 0.

       --bytes-delta=NUM
	   Treat the bytes field on two	records	as being the same if the
	   values differ by NUM	bytes or less.	If not specified, the default
	   is 0.

       --stime-delta=NUM
	   Treat the start-time	field on two records as	being the same if the
	   values differ by NUM	milliseconds or	less.  If not specified, the
	   default is 0.

       --duration-delta=NUM
	   Treat the duration field on two records as being the	same if	the
	   values differ by NUM	milliseconds or	less.  If not specified, the
	   default is 0.

       --temp-directory=DIR_PATH
	   Specify the name of the directory in	which to store data files
	   temporarily when more records have been read	that will fit into
	   RAM.	 This switch overrides the directory specified in the
	   SILK_TMPDIR environment variable, which overrides the directory
	   specified in	the TMPDIR variable, which overrides the default,
	   /tmp.

       --buffer-size=SIZE
	   Set the maximum size	of the buffer to use for holding the records,
	   in bytes.  A	larger buffer means fewer temporary files need to be
	   created, reducing the I/O wait times.  The default maximum for this
	   buffer is near 2GB.	The SIZE may be	given as an ordinary integer,
	   or as a real	number followed	by a suffix "K", "M" or	"G", which
	   represents the numerical value multiplied by	1,024 (kilo),
	   1,048,576 (mega), and 1,073,741,824 (giga), respectively.  For
	   example, 1.5K represents 1,536 bytes, or one	and one-half
	   kilobytes.  (This value does	not represent the absolute maximum
	   amount of RAM that rwdedupe will allocate, since additional buffers
	   will	be allocated for reading the input and writing the output.)

       --output-path=PATH
	   Write the binary SiLK Flow records to PATH, where PATH is a
	   filename, a named pipe, the keyword "stderr"	to write the output to
	   the standard	error, or the keyword "stdout" or "-" to write the
	   output to the standard output.  If PATH names an existing file,
	   rwdedupe exits with an error	unless the SILK_CLOBBER	environment
	   variable is set, in which case PATH is overwritten.	If this	switch
	   is not given, the output is written to the standard output.
	   Attempting to write the binary output to a terminal causes rwdedupe
	   to exit with	an error.

       --note-add=TEXT
	   Add the specified TEXT to the header	of the output file as an
	   annotation.	This switch may	be repeated to add multiple
	   annotations to a file.  To view the annotations, use	the
	   rwfileinfo(1) tool.

       --note-file-add=FILENAME
	   Open	FILENAME and add the contents of that file to the header of
	   the output file as an annotation.	This switch may	be repeated to
	   add multiple	annotations.  Currently	the application	makes no
	   effort to ensure that FILENAME contains text; be careful that you
	   do not attempt to add a SiLK	data file as an	annotation.

       --compression-method=COMP_METHOD
	   Specify the compression library to use when writing output files.
	   If this switch is not given,	the value in the
	   SILK_COMPRESSION_METHOD environment variable	is used	if the value
	   names an available compression method.  When	no compression method
	   is specified, output	to the standard	output or to named pipes is
	   not compressed, and output to files is compressed using the default
	   chosen when SiLK was	compiled.  The valid values for	COMP_METHOD
	   are determined by which external libraries were found when SiLK was
	   compiled.  To see the available compression methods and the default
	   method, use the --help or --version switch.	SiLK can support the
	   following COMP_METHOD values	when the required libraries are
	   available.

	   none
	       Do not compress the output using	an external library.

	   zlib
	       Use the zlib(3) library for compressing the output, and always
	       compress	the output regardless of the destination.  Using zlib
	       produces	the smallest output files at the cost of speed.

	   lzo1x
	       Use the lzo1x algorithm from the	LZO real time compression
	       library for compression,	and always compress the	output
	       regardless of the destination.  This compression	provides good
	       compression with	less memory and	CPU overhead.

	   snappy
	       Use the snappy library for compression, and always compress the
	       output regardless of the	destination.  This compression
	       provides	good compression with less memory and CPU overhead.
	       Since SiLK 3.13.0.

	   best
	       Use lzo1x if available, otherwise use snappy if available,
	       otherwise use zlib if available.	 Only compress the output when
	       writing to a file.

       --print-filenames
	   Print to the	standard error the names of input files	as they	are
	   opened.

       --site-config-file=FILENAME
	   Read	the SiLK site configuration from the named file	FILENAME.
	   When	this switch is not provided, rwdedupe searches for the site
	   configuration file in the locations specified in the	"FILES"
	   section.

       --xargs
       --xargs=FILENAME
	   Read	the names of the input files from FILENAME or from the
	   standard input if FILENAME is not provided.	The input is expected
	   to have one filename	per line.  rwdedupe opens each named file in
	   turn	and reads records from it as if	the filenames had been listed
	   on the command line.

       --help
	   Print the available options and exit.

       --help-fields
	   Print the description and alias(es) of each field and exit.

       --version
	   Print the version number and	information about how SiLK was
	   configured, then exit the application.

LIMITATIONS
       When the	temporary files	and the	final output are stored	on the same
       file volume, rwdedupe will require approximately	twice as much free
       disk space as the size of input data.

       When the	temporary files	and the	final output are on different volumes,
       rwdedupe	will require between 1 and 1.5 times as	much free space	on the
       temporary volume	as the size of the input data.

EXAMPLE
       In the following	examples, the dollar sign ("$")	represents the shell
       prompt.	The text after the dollar sign represents the command line.

       Suppose you have	made several rwfilter(1) runs to find interesting
       traffic:

	$ rwfilter --start-date=2008/02/04 ... --pass=data1.rw
	$ rwfilter --start-date=2008/02/04 ... --pass=data2.rw
	$ rwfilter --start-date=2008/02/04 ... --pass=data3.rw
	$ rwfilter --start-date=2008/02/04 ... --pass=data4.rw

       You now want to merge that traffic into a single	output file, but you
       want to ensure that any records appearing in multiple output files are
       only counted once.  You can use rwdedupe	to merge the output files to a
       single file, data.rw:

	$ rwdedupe data1.rw data2.rw data3.rw data4.rw --output=data.rw

ENVIRONMENT
       SILK_TMPDIR
	   When	set and	--temp-directory is not	specified, rwdedupe writes the
	   temporary files it creates to this directory.  SILK_TMPDIR
	   overrides the value of TMPDIR.

       TMPDIR
	   When	set and	SILK_TMPDIR is not set,	rwdedupe writes	the temporary
	   files it creates to this directory.

       SILK_CLOBBER
	   The SiLK tools normally refuse to overwrite existing	files.
	   Setting SILK_CLOBBER	to a non-empty value removes this restriction.

       SILK_COMPRESSION_METHOD
	   This	environment variable is	used as	the value for
	   --compression-method	when that switch is not	provided.  Since SiLK
	   3.13.0.

       SILK_CONFIG_FILE
	   This	environment variable is	used as	the value for the
	   --site-config-file when that	switch is not provided.

       SILK_DATA_ROOTDIR
	   This	environment variable specifies the root	directory of data
	   repository.	As described in	the "FILES" section, rwdedupe may use
	   this	environment variable when searching for	the SiLK site
	   configuration file.

       SILK_PATH
	   This	environment variable gives the root of the install tree.  When
	   searching for configuration files, rwdedupe may use this
	   environment variable.  See the "FILES" section for details.

       SILK_TEMPFILE_DEBUG
	   When	set to 1, rwdedupe prints debugging messages to	the standard
	   error as it creates,	re-opens, and removes temporary	files.

FILES
       ${SILK_CONFIG_FILE}
       ${SILK_DATA_ROOTDIR}/silk.conf
       /data/silk.conf
       ${SILK_PATH}/share/silk/silk.conf
       ${SILK_PATH}/share/silk.conf
       /usr/local/share/silk/silk.conf
       /usr/local/share/silk.conf
	   Possible locations for the SiLK site	configuration file which are
	   checked when	the --site-config-file switch is not provided.

       ${SILK_TMPDIR}/
       ${TMPDIR}/
       /tmp/
	   Directory in	which to create	temporary files.

SEE ALSO
       rwfilter(1), rwfileinfo(1), sensor.conf(5), silk(7), yaf(1), zlib(3)

SiLK 3.19.1			  2021-02-28			   rwdedupe(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | LIMITATIONS | EXAMPLE | ENVIRONMENT | FILES | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=rwdedupe&sektion=1&manpath=FreeBSD+13.0-RELEASE+and+Ports>

home | help