Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages

  
 
  

home | help
rwsplit(1)			SiLK Tool Suite			    rwsplit(1)

NAME
       rwsplit - Divide	a SiLK file into a (sampled) collection	of subfiles

SYNOPSIS
	 rwsplit --basename=BASENAME
	       { --ip-limit=LIMIT | --flow-limit=LIMIT
		 | --packet-limit=LIMIT	| --byte-limit=LIMIT }
	       [--seed=NUMBER] [--sample-ratio=SAMPLE_RATIO]
	       [--file-ratio=FILE_RATIO] [--max-outputs=MAX_OUTPUTS]
	       [--note-add=TEXT] [--note-file-add=FILE]
	       [--compression-method=COMP_METHOD]
	       [--print-filenames] [--site-config-file=FILENAME]
	       [--xargs[=FILE] | FILE [FILES...]]

	 rwsplit --help

	 rwsplit --version

DESCRIPTION
       rwsplit reads SiLK Flow records from the	standard input or from files
       named on	the command line and writes the	flows into a set of subfiles
       based on	the splitting criterion.  In its simplest form,	rwsplit
       partitions the file, meaning that each input flow will appear in	one
       (and only one) of the subfiles.

       In addition to splitting	the file, rwsplit can generate files
       containing sample flows.	 Sampling is specified by using	the
       --sample-ratio and --file-ratio switches.

       rwsplit reads SiLK Flow records from the	files named on the command
       line or from the	standard input when no file names are specified	and
       --xargs is not present.	To read	the standard input in addition to the
       named files, use	"-" or "stdin" as a file name.	If an input file name
       ends in ".gz", the file is uncompressed as it is	read.  When the
       --xargs switch is provided, rwsplit reads the names of the files	to
       process from the	named text file	or from	the standard input if no file
       name argument is	provided to the	switch.	 The input to --xargs must
       contain one file	name per line.

       If you wish to use the size of the output files as the splitting
       criterion, use the --flow-limit switch.	The paramater to this switch
       should be the size of the desired output	files divided by the record
       size.  The record size can be determined	by rwfileinfo(1).  When	the
       output files are	compressed (see	the description	of
       --compression-method below), you	should assume about a 50% compression
       ratio.

OPTIONS
       Option names may	be abbreviated if the abbreviation is unique or	is an
       exact match for an option.  A parameter to an option may	be specified
       as --arg=param or --arg param, though the first form is required	for
       options that take optional parameters.

       The splitting criterion is defined using	one of the limit specifiers;
       one and only one	must be	specified.  They are:

       --ip-limit=LIMIT
	   Close the current subfile and begin a new subfile when the count of
	   unique source and destination IPs in	the current subfile meets or
	   exceeds LIMIT.  The next-hop-IP does	not count toward LIMIT.

       --flow-limit=LIMIT
	   Close the current subfile and begin a new subfile when the number
	   of SiLK Flow	records	in the current subfile meets LIMIT.

       --packet-limit=LIMIT
	   Close the current subfile and begin a new subfile when the sum of
	   the packet counts across all	SiLK Flow records in the current
	   subfile meets or exceeds LIMIT.

       --byte-limit=LIMIT
	   Close the current subfile and begin a new subfile when the sum of
	   the byte counts across all SiLK Flow	records	in the current subfile
	   meets or exceeds LIMIT.  This switch	does not specify the size of
	   the subfiles.

       The other switches are:

       --basename=BASENAME
	   Specifies the basename of the output	files; this switch is
	   required.  The flows	are written sequentially to a set of subfiles
	   whose names follow the format BASENAME.ORDER.rwf, where ORDER is an
	   8-digit zero-formatted sequence number (i.e., 00000000, 00000001,
	   and so on).	The sequence number will begin at zero and increase by
	   one for every file written, unless --file-ratio is specified,

       --seed=NUMBER
	   Use NUMBER to seed the pseudo-random	number generator for the
	   --sample-ratio or --file-ratio switch.  This	can be used to put the
	   random number generator into	a known	state, which is	useful for
	   testing.

       --sample-ratio=SAMPLE_RATIO
	   Writes one flow record, chosen at random, from every	SAMPLE_RATIO
	   flows that are read.

       --file-ratio=FILE_RATIO
	   Picks one subfile, chosen from random, out of every FILE_RATIO
	   names generated, for	writing	to disk.

       --max-outputs=NUMBER
	   Limits the number of	files that are written to disk to NUMBER.

       --note-add=TEXT
	   Add the specified TEXT to the header	of the output file as an
	   annotation.	This switch may	be repeated to add multiple
	   annotations to a file.  To view the annotations, use	the
	   rwfileinfo(1) tool.

       --note-file-add=FILENAME
	   Open	FILENAME and add the contents of that file to the header of
	   the output file as an annotation.	This switch may	be repeated to
	   add multiple	annotations.  Currently	the application	makes no
	   effort to ensure that FILENAME contains text; be careful that you
	   do not attempt to add a SiLK	data file as an	annotation.

       --compression-method=COMP_METHOD
	   Specify the compression library to use when writing output files.
	   If this switch is not given,	the value in the
	   SILK_COMPRESSION_METHOD environment variable	is used	if the value
	   names an available compression method.  When	no compression method
	   is specified, the output files are compressed using the default
	   chosen when SiLK was	compiled.  The valid values for	COMP_METHOD
	   are determined by which external libraries were found when SiLK was
	   compiled.  To see the available compression methods and the default
	   method, use the --help or --version switch.	SiLK can support the
	   following COMP_METHOD values	when the required libraries are
	   available.

	   none
	       Do not compress the output using	an external library.

	   zlib
	       Use the zlib(3) library for compressing the output.  Using zlib
	       produces	the smallest output files at the cost of speed.

	   lzo1x
	       Use the lzo1x algorithm from the	LZO real time compression
	       library for compression.	 This compression provides good
	       compression with	less memory and	CPU overhead.

	   snappy
	       Use the snappy library for compression, and always compress the
	       output regardless of the	destination.  This compression
	       provides	good compression with less memory and CPU overhead.
	       Since SiLK 3.13.0.

	   best
	       Use lzo1x if available, otherwise use snappy if available,
	       otherwise use zlib if available.

       --print-filenames
	   Print to the	standard error the names of input files	as they	are
	   opened.

       --site-config-file=FILENAME
	   Read	the SiLK site configuration from the named file	FILENAME.
	   When	this switch is not provided, rwsplit searches for the site
	   configuration file in the locations specified in the	"FILES"
	   section.

       --xargs
       --xargs=FILENAME
	   Read	the names of the input files from FILENAME or from the
	   standard input if FILENAME is not provided.	The input is expected
	   to have one filename	per line.  rwsplit opens each named file in
	   turn	and reads records from it as if	the filenames had been listed
	   on the command line.

       --help
	   Print the available options and exit.

       --version
	   Print the version number and	information about how SiLK was
	   configured, then exit the application.

EXAMPLES
       In the following	examples, the dollar sign ("$")	represents the shell
       prompt.	The text after the dollar sign represents the command line.
       Lines have been wrapped for improved readability, and the back slash
       ("\") is	used to	indicate a wrapped line.

       Assume a	source file source.rwf;	to split that file into	files that
       each contain about 100 unique IP	addresses:

	$ rwsplit --basename=result --ip-limit=100 source.rwf

       To split	source.rwf into	files that each	contain	100 flows:

	$ rwsplit --basename=result --flow-limit=100 source.rwf

       The following causes rwsplit to sample 1	out of every 10	records	from
       source.rwf; i.e., rwsplit will read 1000	flow records to	produce	each
       subfile:

	$ rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf

       When --file-ratio is specified, the file	names are generated as usual
       (e.g., base-00000000, base-00000001, ...); however, one of these	names
       will be chosen randomly from each set of	--file-ratio candidates, and
       only that file will be written to disk.

	$ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
	$ ls
	result-00000002.rwf
	result-00000008.rwf
	result-00000013.rwf
	result-00000016.rwf

LIMITATIONS
       rwsplit can take	exactly	1 partitioning switch per invocation.

       Partitioning is not exact, rwsplit keeps	appending flow records a file
       until it	meets or exceeds the specified LIMIT.  For example, if you
       specify --ip-limit=100, then rwsplit will fill up the file until	it has
       100 IP addresses	in it; if the file has 99 addresses and	a new record
       with 2 previously unseen	addresses is received, rwsplit will put	this
       in the current file, resulting in a 101-address file.  Similarly, if
       you specify --byte-limit=2000, and rwsplit receives a 10kb flow record,
       that flow record	will be	placed in the current subfile.

       The switches --sample-ratio, --file-ratio, and --max-outputs are
       processed in that order.	 So, when you specify

	$ rwsplit --sample-ratio=10 --ip-limit=100    \
	       --file-ratio=10 --max-outputs=20

       rwsplit will pick 1 out of every	10 flow	records, write that to a file
       until it	has 100	IP's per file, pick 1 out of every 10 files to write,
       and write up to 20 files.  If there are 1000 records, each with 2
       unique IPs in them, then	rwsplit	will write at most 1 file (it will
       write 200 unique	IP addresses, but it may not pick one of the files
       from the	set to write).

ENVIRONMENT
       SILK_CLOBBER
	   The SiLK tools normally refuse to overwrite existing	files.
	   Setting SILK_CLOBBER	to a non-empty value removes this restriction.

       SILK_COMPRESSION_METHOD
	   This	environment variable is	used as	the value for
	   --compression-method	when that switch is not	provided.  Since SiLK
	   3.13.0.

       SILK_CONFIG_FILE
	   This	environment variable is	used as	the value for the
	   --site-config-file when that	switch is not provided.

       SILK_DATA_ROOTDIR
	   This	environment variable specifies the root	directory of data
	   repository.	As described in	the "FILES" section, rwsplit may use
	   this	environment variable when searching for	the SiLK site
	   configuration file.

       SILK_PATH
	   This	environment variable gives the root of the install tree.  When
	   searching for configuration files, rwsplit may use this environment
	   variable.  See the "FILES" section for details.

FILES
       ${SILK_CONFIG_FILE}
       ${SILK_DATA_ROOTDIR}/silk.conf
       /data/silk.conf
       ${SILK_PATH}/share/silk/silk.conf
       ${SILK_PATH}/share/silk.conf
       /usr/local/share/silk/silk.conf
       /usr/local/share/silk.conf
	   Possible locations for the SiLK site	configuration file which are
	   checked when	the --site-config-file switch is not provided.

SEE ALSO
       rwfileinfo(1), silk(7), zlib(3)

SiLK 3.19.1			  2020-08-27			    rwsplit(1)

NAME | SYNOPSIS | DESCRIPTION | OPTIONS | EXAMPLES | LIMITATIONS | ENVIRONMENT | FILES | SEE ALSO

Want to link to this manual page? Use this URL:
<https://www.freebsd.org/cgi/man.cgi?query=rwsplit&sektion=1&manpath=FreeBSD+12.2-RELEASE+and+Ports>

home | help