Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ZFS(4)		       FreeBSD Kernel Interfaces Manual			ZFS(4)

     zfs -- tuning of the ZFS kernel module

     The ZFS module supports these parameters:

     dbuf_cache_max_bytes=ULONG_MAXB (ulong)
	     Maximum size in bytes of the dbuf cache.  The target size is de-
	     termined by the MIN versus	1/2^dbuf_cache_shift (1/32nd) of the
	     target ARC	size.  The behavior of the dbuf	cache and its associ-
	     ated settings can be observed via the
	     /proc/spl/kstat/zfs/dbufstats kstat.

     dbuf_metadata_cache_max_bytes=ULONG_MAXB (ulong)
	     Maximum size in bytes of the metadata dbuf	cache.	The target
	     size is determined	by the MIN versus
	     1/2^dbuf_metadata_cache_shift (1/64th) of the target ARC size.
	     The behavior of the metadata dbuf cache and its associated	set-
	     tings can be observed via the /proc/spl/kstat/zfs/dbufstats

     dbuf_cache_hiwater_pct=10%	(uint)
	     The percentage over dbuf_cache_max_bytes when dbufs must be
	     evicted directly.

     dbuf_cache_lowater_pct=10%	(uint)
	     The percentage below dbuf_cache_max_bytes when the	evict thread
	     stops evicting dbufs.

     dbuf_cache_shift=5	(int)
	     Set the size of the dbuf cache (dbuf_cache_max_bytes) to a	log2
	     fraction of the target ARC	size.

     dbuf_metadata_cache_shift=6 (int)
	     Set the size of the dbuf metadata cache
	     (dbuf_metadata_cache_max_bytes) to	a log2 fraction	of the target
	     ARC size.

     dmu_object_alloc_chunk_shift=7 (128) (int)
	     dnode slots allocated in a	single operation as a power of 2.  The
	     default value minimizes lock contention for the bulk operation

     dmu_prefetch_max=134217728B (128MB) (int)
	     Limit the amount we can prefetch with one call to this amount in
	     bytes.  This helps	to limit the amount of memory that can be used
	     by	prefetching.

     ignore_hole_birth (int)
	     Alias for send_holes_without_birth_time.

     l2arc_feed_again=1|0 (int)
	     Turbo L2ARC warm-up.  When	the L2ARC is cold the fill interval
	     will be set as fast as possible.

     l2arc_feed_min_ms=200 (ulong)
	     Min feed interval in milliseconds.	 Requires l2arc_feed_again=1
	     and only applicable in related situations.

     l2arc_feed_secs=1 (ulong)
	     Seconds between L2ARC writing.

     l2arc_headroom=2 (ulong)
	     How far through the ARC lists to search for L2ARC cacheable con-
	     tent, expressed as	a multiplier of	l2arc_write_max.  ARC persis-
	     tence across reboots can be achieved with persistent L2ARC	by
	     setting this parameter to 0, allowing the full length of ARC
	     lists to be searched for cacheable	content.

     l2arc_headroom_boost=200% (ulong)
	     Scales l2arc_headroom by this percentage when L2ARC contents are
	     being successfully	compressed before writing.  A value of 100
	     disables this feature.

     l2arc_mfuonly=0|1 (int)
	     Controls whether only MFU metadata	and data are cached from ARC
	     into L2ARC.  This may be desired to avoid wasting space on	L2ARC
	     when reading/writing large	amounts	of data	that are not expected
	     to	be accessed more than once.

	     The default is off, meaning both MRU and MFU data and metadata
	     are cached.  When turning off this	feature, some MRU buffers will
	     still be present in ARC and eventually cached on L2ARC.  If
	     l2arc_noprefetch=0, some prefetched buffers will be cached	to
	     L2ARC, and	those might later transition to	MRU, in	which case the
	     l2arc_mru_asize arcstat will not be 0.

	     Regardless	of l2arc_noprefetch, some MFU buffers might be evicted
	     from ARC, accessed	later on as prefetches and transition to MRU
	     as	prefetches.  If	accessed again they are	counted	as MRU and the
	     l2arc_mru_asize arcstat will not be 0.

	     The ARC status of L2ARC buffers when they were first cached in
	     L2ARC can be seen in the l2arc_mru_asize, l2arc_mfu_asize,	and
	     l2arc_prefetch_asize arcstats when	importing the pool or onlining
	     a cache device if persistent L2ARC	is enabled.

	     The evict_l2_eligible_mru arcstat does not	take into account if
	     this option is enabled as the information provided	by the
	     evict_l2_eligible_m[rf]u arcstats can be used to decide if	tog-
	     gling this	option is appropriate for the current workload.

     l2arc_meta_percent=33% (int)
	     Percent of	ARC size allowed for L2ARC-only	headers.  Since	L2ARC
	     buffers are not evicted on	memory pressure, too many headers on a
	     system with an irrationally large L2ARC can render	it slow	or un-
	     usable.  This parameter limits L2ARC writes and rebuilds to
	     achieve the target.

     l2arc_trim_ahead=0% (ulong)
	     Trims ahead of the	current	write size (l2arc_write_max) on	L2ARC
	     devices by	this percentage	of write size if we have filled	the
	     device.  If set to	100 we TRIM twice the space required to	accom-
	     modate upcoming writes.  A	minimum	of 64MB	will be	trimmed.  It
	     also enables TRIM of the whole L2ARC device upon creation or ad-
	     dition to an existing pool	or if the header of the	device is in-
	     valid upon	importing a pool or onlining a cache device.  A	value
	     of	0 disables TRIM	on L2ARC altogether and	is the default as it
	     can put significant stress	on the underlying storage devices.
	     This will vary depending of how well the specific device handles
	     these commands.

     l2arc_noprefetch=1|0 (int)
	     Do	not write buffers to L2ARC if they were	prefetched but not
	     used by applications.  In case there are prefetched buffers in
	     L2ARC and this option is later set, we do not read	the prefetched
	     buffers from L2ARC.  Unsetting this option	is useful for caching
	     sequential	reads from the disks to	L2ARC and serve	those reads
	     from L2ARC	later on.  This	may be beneficial in case the L2ARC
	     device is significantly faster in sequential reads	than the disks
	     of	the pool.

	     Use 1 to disable and 0 to enable caching/reading prefetches
	     to/from L2ARC.

     l2arc_norw=0|1 (int)
	     No	reads during writes.

     l2arc_write_boost=8388608B	(8MB) (ulong)
	     Cold L2ARC	devices	will have l2arc_write_max increased by this
	     amount while they remain cold.

     l2arc_write_max=8388608B (8MB) (ulong)
	     Max write bytes per interval.

     l2arc_rebuild_enabled=1|0 (int)
	     Rebuild the L2ARC when importing a	pool (persistent L2ARC).  This
	     can be disabled if	there are problems importing a pool or attach-
	     ing an L2ARC device (e.g. the L2ARC device	is slow	in reading
	     stored log	metadata, or the metadata has become somehow frag-

     l2arc_rebuild_blocks_min_l2size=1073741824B (1GB) (ulong)
	     Mininum size of an	L2ARC device required in order to write	log
	     blocks in it.  The	log blocks are used upon importing the pool to
	     rebuild the persistent L2ARC.

	     For L2ARC devices less than 1GB, the amount of data l2arc_evict()
	     evicts is significant compared to the amount of restored L2ARC
	     data.  In this case, do not write log blocks in L2ARC in order
	     not to waste space.

     metaslab_aliquot=524288B (512kB) (ulong)
	     Metaslab granularity, in bytes.  This is roughly similar to what
	     would be referred to as the "stripe size" in traditional RAID ar-
	     rays.  In normal operation, ZFS will try to write this amount of
	     data to a top-level vdev before moving on to the next one.

     metaslab_bias_enabled=1|0 (int)
	     Enable metaslab group biasing based on their vdevs' over- or un-
	     der-utilization relative to the pool.

     metaslab_force_ganging=16777217BB (16MB + 1B) (ulong)
	     Make some blocks above a certain size be gang blocks.  This op-
	     tion is used by the test suite to facilitate testing.

     zfs_history_output_max=1048576BB (1MB) (int)
	     When attempting to	log an output nvlist of	an ioctl in the	on-
	     disk history, the output will not be stored if it is larger than
	     this size (in bytes).  This must be less than DMU_MAX_ACCESS
	     (64MB).  This applies primarily to	zfs_ioc_channel_program() (cf.

     zfs_keep_log_spacemaps_at_export=0|1 (int)
	     Prevent log spacemaps from	being destroyed	during pool exports
	     and destroys.

     zfs_metaslab_segment_weight_enabled=1|0 (int)
	     Enable/disable segment-based metaslab selection.

     zfs_metaslab_switch_threshold=2 (int)
	     When using	segment-based metaslab selection, continue allocating
	     from the active metaslab until this option's worth	of buckets
	     have been exhausted.

     metaslab_debug_load=0|1 (int)
	     Load all metaslabs	during pool import.

     metaslab_debug_unload=0|1 (int)
	     Prevent metaslabs from being unloaded.

     metaslab_fragmentation_factor_enabled=1|0 (int)
	     Enable use	of the fragmentation metric in computing metaslab

     metaslab_df_max_search=16777216B (16MB) (int)
	     Maximum distance to search	forward	from the last offset.  Without
	     this limit, fragmented pools can see _100`000 iterations and
	     metaslab_block_picker() becomes the performance limiting factor
	     on	high-performance storage.

	     With the default setting of 16MB, we typically see	less than 500
	     iterations, even with very	fragmented ashift=9 pools.  The	maxi-
	     mum number	of iterations possible is metaslab_df_max_search /
	     2^(ashift+1).  With the default setting of	16MB this is 16*1024
	     (with ashift=9) or	2*1024 (with ashift=12).

     metaslab_df_use_largest_segment=0|1 (int)
	     If	not searching forward (due to metaslab_df_max_search,
	     metaslab_df_free_pct, or metaslab_df_alloc_threshold), this tun-
	     able controls which segment is used.  If set, we will use the
	     largest free segment.  If unset, we will use a segment of at
	     least the requested size.

     zfs_metaslab_max_size_cache_sec=3600s (1h)	(ulong)
	     When we unload a metaslab,	we cache the size of the largest free
	     chunk.  We	use that cached	size to	determine whether or not to
	     load a metaslab for a given allocation.  As more frees accumulate
	     in	that metaslab while it's unloaded, the cached max size becomes
	     less and less accurate.  After a number of	seconds	controlled by
	     this tunable, we stop considering the cached max size and start
	     considering only the histogram instead.

     zfs_metaslab_mem_limit=25%	(int)
	     When we are loading a new metaslab, we check the amount of	memory
	     being used	to store metaslab range	trees.	If it is over a
	     threshold,	we attempt to unload the least recently	used metaslab
	     to	prevent	the system from	clogging all of	its memory with	range
	     trees.  This tunable sets the percentage of total system memory
	     that is the threshold.

     zfs_metaslab_try_hard_before_gang=0|1 (int)
	     If	unset, we will first try normal	allocation.
	     If	that fails then	we will	do a gang allocation.
	     If	that fails then	we will	do a "try hard"	gang allocation.
	     If	that fails then	we will	have a multi-layer gang	block.

	     If	set, we	will first try normal allocation.
	     If	that fails then	we will	do a "try hard"	allocation.
	     If	that fails we will do a	gang allocation.
	     If	that fails we will do a	"try hard" gang	allocation.
	     If	that fails then	we will	have a multi-layer gang	block.

     zfs_metaslab_find_max_tries=100 (int)
	     When not trying hard, we only consider this number	of the best
	     metaslabs.	 This improves performance, especially when there are
	     many metaslabs per	vdev and the allocation	can't actually be sat-
	     isfied (so	we would otherwise iterate all metaslabs).

     zfs_vdev_default_ms_count=200 (int)
	     When a vdev is added, target this number of metaslabs per top-
	     level vdev.

     zfs_vdev_default_ms_shift=29 (512MB) (int)
	     Default limit for metaslab	size.

     zfs_vdev_max_auto_ashift=ASHIFT_MAX (16) (ulong)
	     Maximum ashift used when optimizing for logical ->	physical sec-
	     tor size on new top-level vdevs.

     zfs_vdev_min_auto_ashift=ASHIFT_MIN (9) (ulong)
	     Minimum ashift used when creating new top-level vdevs.

     zfs_vdev_min_ms_count=16 (int)
	     Minimum number of metaslabs to create in a	top-level vdev.

     vdev_validate_skip=0|1 (int)
	     Skip label	validation steps during	pool import.  Changing is not
	     recommended unless	you know what you're doing and are recovering
	     a damaged label.

     zfs_vdev_ms_count_limit=131072 (128k) (int)
	     Practical upper limit of total metaslabs per top-level vdev.

     metaslab_preload_enabled=1|0 (int)
	     Enable metaslab group preloading.

     metaslab_lba_weighting_enabled=1|0	(int)
	     Give more weight to metaslabs with	lower LBAs, assuming they have
	     greater bandwidth,	as is typically	the case on a modern constant
	     angular velocity disk drive.

     metaslab_unload_delay=32 (int)
	     After a metaslab is used, we keep it loaded for this many TXGs,
	     to	attempt	to reduce unnecessary reloading.  Note that both this
	     many TXGs and metaslab_unload_delay_ms milliseconds must pass be-
	     fore unloading will occur.

     metaslab_unload_delay_ms=600000ms (10min) (int)
	     After a metaslab is used, we keep it loaded for this many mil-
	     liseconds,	to attempt to reduce unnecessary reloading.  Note,
	     that both this many milliseconds and metaslab_unload_delay	TXGs
	     must pass before unloading	will occur.

     reference_history=3 (int)
	     Maximum reference holders being tracked when reference_track-
	     ing_enable	is active.

     reference_tracking_enable=0|1 (int)
	     Track reference holders to	refcount_t objects (debug builds

     send_holes_without_birth_time=1|0 (int)
	     When set, the hole_birth optimization will	not be used, and all
	     holes will	always be sent during a	zfs send.  This	is useful if
	     you suspect your datasets are affected by a bug in	hole_birth.

     spa_config_path=/etc/zfs/zpool.cache (charp)
	     SPA config	file.

     spa_asize_inflation=24 (int)
	     Multiplication factor used	to estimate actual disk	consumption
	     from the size of data being written.  The default value is	a
	     worst case	estimate, but lower values may be valid	for a given
	     pool depending on its configuration.  Pool	administrators who un-
	     derstand the factors involved may wish to specify a more realis-
	     tic inflation factor, particularly	if they	operate	close to quota
	     or	capacity limits.

     spa_load_print_vdev_tree=0|1 (int)
	     Whether to	print the vdev tree in the debugging message buffer
	     during pool import.

     spa_load_verify_data=1|0 (int)
	     Whether to	traverse data blocks during an "extreme	rewind"	(-X)

	     An	extreme	rewind import normally performs	a full traversal of
	     all blocks	in the pool for	verification.  If this parameter is
	     unset, the	traversal skips	non-metadata blocks.  It can be	tog-
	     gled once the import has started to stop or start the traversal
	     of	non-metadata blocks.

     spa_load_verify_metadata=1|0 (int)
	     Whether to	traverse blocks	during an "extreme rewind" (-X)	pool

	     An	extreme	rewind import normally performs	a full traversal of
	     all blocks	in the pool for	verification.  If this parameter is
	     unset, the	traversal is not performed.  It	can be toggled once
	     the import	has started to stop or start the traversal.

     spa_load_verify_shift=4 (1/16th) (int)
	     Sets the maximum number of	bytes to consume during	pool import to
	     the log2 fraction of the target ARC size.

     spa_slop_shift=5 (1/32nd) (int)
	     Normally, we don't	allow the last 3.2% (1/2^spa_slop_shift) of
	     space in the pool to be consumed.	This ensures that we don't run
	     the pool completely out of	space, due to unaccounted changes
	     (e.g. to the MOS).	 It also limits	the worst-case time to allo-
	     cate space.  If we	have less than this amount of free space, most
	     ZPL operations (e.g. write, create) will return ENOSPC.

     vdev_removal_max_span=32768B (32kB) (int)
	     During top-level vdev removal, chunks of data are copied from the
	     vdev which	may include free space in order	to trade bandwidth for
	     IOPS.  This parameter determines the maximum span of free space,
	     in	bytes, which will be included as "unnecessary" data in a chunk
	     of	copied data.

	     The default value here was	chosen to align	with
	     zfs_vdev_read_gap_limit, which is a similar concept when doing
	     regular reads (but	there's	no reason it has to be the same).

     vdev_file_logical_ashift=9	(512B) (ulong)
	     Logical ashift for	file-based devices.

     vdev_file_physical_ashift=9 (512B)	(ulong)
	     Physical ashift for file-based devices.

     zap_iterate_prefetch=1|0 (int)
	     If	set, when we start iterating over a ZAP	object,	prefetch the
	     entire object (all	leaf blocks).  However,	this is	limited	by

     zfetch_array_rd_sz=1048576B (1MB) (ulong)
	     If	prefetching is enabled,	disable	prefetching for	reads larger
	     than this size.

     zfetch_max_distance=8388608B (8MB)	(uint)
	     Max bytes to prefetch per stream.

     zfetch_max_idistance=67108864B (64MB) (uint)
	     Max bytes to prefetch indirects for per stream.

     zfetch_max_streams=8 (uint)
	     Max number	of streams per zfetch (prefetch	streams	per file).

     zfetch_min_sec_reap=2 (uint)
	     Min time before an	active prefetch	stream can be reclaimed

     zfs_abd_scatter_enabled=1|0 (int)
	     Enables ARC from using scatter/gather lists and forces all	allo-
	     cations to	be linear in kernel memory.  Disabling can improve
	     performance in some code paths at the expense of fragmented ker-
	     nel memory.

     zfs_abd_scatter_max_order=MAX_ORDER-1 (uint)
	     Maximum number of consecutive memory pages	allocated in a single
	     block for scatter/gather lists.

	     The value of MAX_ORDER depends on kernel configuration.

     zfs_abd_scatter_min_size=1536B (1.5kB) (uint)
	     This is the minimum allocation size that will use scatter (page-
	     based) ABDs.  Smaller allocations will use	linear ABDs.

     zfs_arc_dnode_limit=0B (ulong)
	     When the number of	bytes consumed by dnodes in the	ARC exceeds
	     this number of bytes, try to unpin	some of	it in response to de-
	     mand for non-metadata.  This value	acts as	a ceiling to the
	     amount of dnode metadata, and defaults to 0, which	indicates that
	     a percent which is	based on zfs_arc_dnode_limit_percent of	the
	     ARC meta buffers that may be used for dnodes.

	     Also see zfs_arc_meta_prune which serves a	similar	purpose	but is
	     used when the amount of metadata in the ARC exceeds
	     zfs_arc_meta_limit	rather than in response	to overall demand for

     zfs_arc_dnode_limit_percent=10% (ulong)
	     Percentage	that can be consumed by	dnodes of ARC meta buffers.

	     See also zfs_arc_dnode_limit, which serves	a similar purpose but
	     has a higher priority if nonzero.

     zfs_arc_dnode_reduce_percent=10% (ulong)
	     Percentage	of ARC dnodes to try to	scan in	response to demand for
	     non-metadata when the number of bytes consumed by dnodes exceeds

     zfs_arc_average_blocksize=8192B (8kB) (int)
	     The ARC's buffer hash table is sized based	on the assumption of
	     an	average	block size of this value.  This	works out to roughly
	     1MB of hash table per 1GB of physical memory with 8-byte point-
	     ers.  For configurations with a known larger average block	size,
	     this value	can be increased to reduce the memory footprint.

     zfs_arc_eviction_pct=200% (int)
	     When arc_is_overflowing(),	arc_get_data_impl() waits for this
	     percent of	the requested amount of	data to	be evicted.  For exam-
	     ple, by default, for every	2kB that's evicted, 1kB	of it may be
	     "reused" by a new allocation.  Since this is above	100%, it en-
	     sures that	progress is made towards getting arc_size under	arc_c.
	     Since this	is finite, it ensures that allocations can still hap-
	     pen, even during the potentially long time	that arc_size is more
	     than arc_c.

     zfs_arc_evict_batch_limit=10 (int)
	     Number ARC	headers	to evict per sub-list before proceeding	to an-
	     other sub-list.  This batch-style operation prevents entire sub-
	     lists from	being evicted at once but comes	at a cost of addi-
	     tional unlocking and locking.

     zfs_arc_grow_retry=0s (int)
	     If	set to a non zero value, it will replace the arc_grow_retry
	     value with	this value.  The arc_grow_retry	value (default 5s) is
	     the number	of seconds the ARC will	wait before trying to resume
	     growth after a memory pressure event.

     zfs_arc_lotsfree_percent=10% (int)
	     Throttle I/O when free system memory drops	below this percentage
	     of	total system memory.  Setting this value to 0 will disable the

     zfs_arc_max=0B (ulong)
	     Max size of ARC in	bytes.	If 0, then the max size	of ARC is de-
	     termined by the amount of system memory installed.	 Under Linux,
	     half of system memory will	be used	as the limit.  Under FreeBSD,
	     the larger	of all_system_memory - 1GB and 5/8 * all_system_memory
	     will be used as the limit.	 This value must be at least 67108864B

	     This value	can be changed dynamically, with some caveats.	It
	     cannot be set back	to 0 while running, and	reducing it below the
	     current ARC size will not cause the ARC to	shrink without memory
	     pressure to induce	shrinking.

     zfs_arc_meta_adjust_restarts=4096 (ulong)
	     The number	of restart passes to make while	scanning the ARC at-
	     tempting the free buffers in order	to stay	below the
	     fs_arc_meta_limit.	 This value should not need to be tuned	but is
	     available to facilitate performance analysis.

     zfs_arc_meta_limit=0B (ulong)
	     The maximum allowed size in bytes that metadata buffers are al-
	     lowed to consume in the ARC.  When	this limit is reached, meta-
	     data buffers will be reclaimed, even if the overall arc_c_max has
	     not been reached.	It defaults to 0, which	indicates that a per-
	     centage based on zfs_arc_meta_limit_percent of the	ARC may	be
	     used for metadata.

	     This value	my be changed dynamically, except that must be set to
	     an	explicit value (cannot be set back to 0).

     zfs_arc_meta_limit_percent=75% (ulong)
	     Percentage	of ARC buffers that can	be used	for metadata.

	     See also zfs_arc_meta_limit, which	serves a similar purpose but
	     has a higher priority if nonzero.

     zfs_arc_meta_min=0B (ulong)
	     The minimum allowed size in bytes that metadata buffers may con-
	     sume in the ARC.

     zfs_arc_meta_prune=10000 (int)
	     The number	of dentries and	inodes to be scanned looking for en-
	     tries which can be	dropped.  This may be required when the	ARC
	     reaches the zfs_arc_meta_limit because dentries and inodes	can
	     pin buffers in the	ARC.  Increasing this value will cause to den-
	     try and inode caches to be	pruned more aggressively.  Setting
	     this value	to 0 will disable pruning the inode and	dentry caches.

     zfs_arc_meta_strategy=1|0 (int)
	     Define the	strategy for ARC metadata buffer eviction (meta	re-
	     claim strategy):
		 0 (META_ONLY)	evict only the ARC metadata buffers
		 1 (BALANCED)	additional data	buffers	may be evicted if re-
				quired to evict	the required number of meta-
				data buffers.

     zfs_arc_min=0B (ulong)
	     Min size of ARC in	bytes.	If set to 0, arc_c_min will default to
	     consuming the larger of 32MB or all_system_memory/32.

     zfs_arc_min_prefetch_ms=0ms(a!1s) (int)
	     Minimum time prefetched blocks are	locked in the ARC.

     zfs_arc_min_prescient_prefetch_ms=0ms(a!6s) (int)
	     Minimum time "prescient prefetched" blocks	are locked in the ARC.
	     These blocks are meant to be prefetched fairly aggressively ahead
	     of	the code that may use them.

     zfs_arc_prune_task_threads=1 (int)
	     Number of arc_prune threads.  FreeBSD does	not need more than
	     one.  Linux may theoretically use one per mount point up to num-
	     ber of CPUs, but that was not proven to be	useful.

     zfs_max_missing_tvds=0 (int)
	     Number of missing top-level vdevs which will be allowed during
	     pool import (only in read-only mode).

     zfs_max_nvlist_src_size= 0	(ulong)
	     Maximum size in bytes allowed to be passed	as zc_nvlist_src_size
	     for ioctls	on /dev/zfs.  This prevents a user from	causing	the
	     kernel to allocate	an excessive amount of memory.	When the limit
	     is	exceeded, the ioctl fails with EINVAL and a description	of the
	     error is sent to the zfs-dbgmsg log.  This	parameter should not
	     need to be	touched	under normal circumstances.  If	0, equivalent
	     to	a quarter of the user-wired memory limit under FreeBSD and to
	     134217728B	(128MB)	under Linux.

     zfs_multilist_num_sublists=0 (int)
	     To	allow more fine-grained	locking, each ARC state	contains a se-
	     ries of lists for both data and metadata objects.	Locking	is
	     performed at the level of these "sub-lists".  This	parameters
	     controls the number of sub-lists per ARC state, and also applies
	     to	other uses of the multilist data structure.

	     If	0, equivalent to the greater of	the number of online CPUs and

     zfs_arc_overflow_shift=8 (int)
	     The ARC size is considered	to be overflowing if it	exceeds	the
	     current ARC target	size (arc_c) by	thresholds determined by this
	     parameter.	 Exceeding by (arc_c >>	zfs_arc_overflow_shift)	* 0.5
	     starts ARC	reclamation process.  If that appears insufficient,
	     exceeding by (arc_c >> zfs_arc_overflow_shift) * 1.5 blocks new
	     buffer allocation until the reclaim thread	catches	up.  Started
	     reclamation process continues till	ARC size returns below the
	     target size.

	     The default value of 8 causes the ARC to start reclamation	if it
	     exceeds the target	size by	0.2% of	the target size, and block al-
	     locations by 0.6%.

     zfs_arc_p_min_shift=0 (int)
	     If	nonzero, this will update arc_p_min_shift (default 4) with the
	     new value.	 arc_p_min_shift is used as a shift of arc_c when cal-
	     culating the minumum arc_p	size.

     zfs_arc_p_dampener_disable=1|0 (int)
	     Disable arc_p adapt dampener, which reduces the maximum single
	     adjustment	to arc_p.

     zfs_arc_shrink_shift=0 (int)
	     If	nonzero, this will update arc_shrink_shift (default 7) with
	     the new value.

     zfs_arc_pc_percent=0% (off) (uint)
	     Percent of	pagecache to reclaim ARC to.

	     This tunable allows the ZFS ARC to	play more nicely with the ker-
	     nel's LRU pagecache.  It can guarantee that the ARC size won't
	     collapse under scanning pressure on the pagecache,	yet still al-
	     lows the ARC to be	reclaimed down to zfs_arc_min if necessary.
	     This value	is specified as	percent	of pagecache size (as measured
	     by	NR_FILE_PAGES),	where that percent may exceed 100.  This only
	     operates during memory pressure/reclaim.

     zfs_arc_shrinker_limit=10000 (int)
	     This is a limit on	how many pages the ARC shrinker	makes avail-
	     able for eviction in response to one page allocation attempt.
	     Note that in practice, the	kernel's shrinker can ask us to	evict
	     up	to about four times this for one allocation attempt.

	     The default limit of 10000	(in practice, 160MB per	allocation
	     attempt with 4kB pages) limits the	amount of time spent attempt-
	     ing to reclaim ARC	memory to less than 100ms per allocation at-
	     tempt, even with a	small average compressed block size of ~8kB.

	     The parameter can be set to 0 (zero) to disable the limit,	and
	     only applies on Linux.

     zfs_arc_sys_free=0B (ulong)
	     The target	number of bytes	the ARC	should leave as	free memory on
	     the system.  If zero, equivalent to the bigger of 512kB and

     zfs_autoimport_disable=1|0	(int)
	     Disable pool import at module load	by ignoring the	cache file

     zfs_checksum_events_per_second=20/s (uint)
	     Rate limit	checksum events	to this	many per second.  Note that
	     this should not be	set below the ZED thresholds (currently	10
	     checksums over 10 seconds)	or else	the daemon may not trigger any

     zfs_commit_timeout_pct=5% (int)
	     This controls the amount of time that a ZIL block (lwb) will re-
	     main "open" when it isn't "full", and it has a thread waiting for
	     it	to be committed	to stable storage.  The	timeout	is scaled
	     based on a	percentage of the last lwb latency to avoid signifi-
	     cantly impacting the latency of each individual transaction
	     record (itx).

     zfs_condense_indirect_commit_entry_delay_ms=0ms (int)
	     Vdev indirection layer (used for device removal) sleeps for this
	     many milliseconds during mapping generation.  Intended for	use
	     with the test suite to throttle vdev removal speed.

     zfs_condense_indirect_obsolete_pct=25% (int)
	     Minimum percent of	obsolete bytes in vdev mapping required	to at-
	     tempt to condense (see zfs_condense_indirect_vdevs_enable).  In-
	     tended for	use with the test suite	to facilitate triggering con-
	     densing as	needed.

     zfs_condense_indirect_vdevs_enable=1|0 (int)
	     Enable condensing indirect	vdev mappings.	When set, attempt to
	     condense indirect vdev mappings if	the mapping uses more than
	     zfs_condense_min_mapping_bytes bytes of memory and	if the obso-
	     lete space	map object uses	more than
	     zfs_condense_max_obsolete_bytes bytes on-disk.  The condensing
	     process is	an attempt to save memory by removing obsolete map-

     zfs_condense_max_obsolete_bytes=1073741824B (1GB) (ulong)
	     Only attempt to condense indirect vdev mappings if	the on-disk
	     size of the obsolete space	map object is greater than this	number
	     of	bytes (see zfs_condense_indirect_vdevs_enable).

     zfs_condense_min_mapping_bytes=131072B (128kB) (ulong)
	     Minimum size vdev mapping to attempt to condense (see

     zfs_dbgmsg_enable=1|0 (int)
	     Internally	ZFS keeps a small log to facilitate debugging.	The
	     log is enabled by default,	and can	be disabled by unsetting this
	     option.  The contents of the log can be accessed by reading
	     /proc/spl/kstat/zfs/dbgmsg.  Writing 0 to the file	clears the

	     This setting does not influence debug prints due to zfs_flags.

     zfs_dbgmsg_maxsize=4194304B (4MB) (int)
	     Maximum size of the internal ZFS debug log.

     zfs_dbuf_state_index=0 (int)
	     Historically used for controlling what reporting was available
	     under /proc/spl/kstat/zfs.	 No effect.

     zfs_deadman_enabled=1|0 (int)
	     When a pool sync operation	takes longer than
	     zfs_deadman_synctime_ms, or when an individual I/O	operation
	     takes longer than zfs_deadman_ziotime_ms, then the	operation is
	     considered	to be "hung".  If zfs_deadman_enabled is set, then the
	     deadman behavior is invoked as described by zfs_deadman_failmode.
	     By	default, the deadman is	enabled	and set	to wait	which results
	     in	"hung" I/Os only being logged.	The deadman is automatically
	     disabled when a pool gets suspended.

     zfs_deadman_failmode=wait (charp)
	     Controls the failure behavior when	the deadman detects a "hung"
	     I/O operation.  Valid values are:
		 wait	   Wait	for a "hung" operation to complete.  For each
			   "hung" operation a "deadman"	event will be posted
			   describing that operation.
		 continue  Attempt to recover from a "hung" operation by re-
			   dispatching it to the I/O pipeline if possible.
		 panic	   Panic the system.  This can be used to facilitate
			   automatic fail-over to a properly configured	fail-
			   over	partner.

     zfs_deadman_checktime_ms=60000ms (1min) (int)
	     Check time	in milliseconds.  This defines the frequency at	which
	     we	check for hung I/O requests and	potentially invoke the
	     zfs_deadman_failmode behavior.

     zfs_deadman_synctime_ms=600000ms (10min) (ulong)
	     Interval in milliseconds after which the deadman is triggered and
	     also the interval after which a pool sync operation is considered
	     to	be "hung".  Once this limit is exceeded	the deadman will be
	     invoked every zfs_deadman_checktime_ms milliseconds until the
	     pool sync completes.

     zfs_deadman_ziotime_ms=300000ms (5min) (ulong)
	     Interval in milliseconds after which the deadman is triggered and
	     an	individual I/O operation is considered to be "hung".  As long
	     as	the operation remains "hung", the deadman will be invoked ev-
	     ery zfs_deadman_checktime_ms milliseconds until the operation

     zfs_dedup_prefetch=0|1 (int)
	     Enable prefetching	dedup-ed blocks	which are going	to be freed.

     zfs_delay_min_dirty_percent=60% (int)
	     Start to delay each transaction once there	is this	amount of
	     dirty data, expressed as a	percentage of zfs_dirty_data_max.
	     This value	should be at least
	     zfs_vdev_async_write_active_max_dirty_percent.  See ZFS

     zfs_delay_scale=500000 (int)
	     This controls how quickly the transaction delay approaches	infin-
	     ity.  Larger values cause longer delays for a given amount	of
	     dirty data.

	     For the smoothest delay, this value should	be about 1 billion di-
	     vided by the maximum number of operations per second.  This will
	     smoothly handle between ten times and a tenth of this number.

	     zfs_delay_scale * zfs_dirty_data_max must be smaller than 2^64.

     zfs_disable_ivset_guid_check=0|1 (int)
	     Disables requirement for IVset GUIDs to be	present	and match when
	     doing a raw receive of encrypted datasets.	 Intended for users
	     whose pools were created with OpenZFS pre-release versions	and
	     now have compatibility issues.

     zfs_key_max_salt_uses=400000000 (4*10^8) (ulong)
	     Maximum number of uses of a single	salt value before generating a
	     new one for encrypted datasets.  The default value	is also	the

     zfs_object_mutex_size=64 (uint)
	     Size of the znode hashtable used for holds.

	     Due to the	need to	hold locks on objects that may not exist yet,
	     kernel mutexes are	not created per-object and instead a hashtable
	     is	used where collisions will result in objects waiting when
	     there is not actually contention on the same object.

     zfs_slow_io_events_per_second=20/s	(int)
	     Rate limit	delay and deadman zevents (which report	slow I/Os) to
	     this many per second.

     zfs_unflushed_max_mem_amt=1073741824B (1GB) (ulong)
	     Upper-bound limit for unflushed metadata changes to be held by
	     the log spacemap in memory, in bytes.

     zfs_unflushed_max_mem_ppm=1000ppm (0.1%) (ulong)
	     Part of overall system memory that	ZFS allows to be used for un-
	     flushed metadata changes by the log spacemap, in millionths.

     zfs_unflushed_log_block_max=262144	(256k) (ulong)
	     Describes the maximum number of log spacemap blocks allowed for
	     each pool.	 The default value means that the space	in all the log
	     spacemaps can add up to no	more than 262144 blocks	(which means
	     32GB of logical space before compression and ditto	blocks,	assum-
	     ing that blocksize	is 128kB).

	     This tunable is important because it involves a trade-off between
	     import time after an unclean export and the frequency of flushing
	     metaslabs.	 The higher this number	is, the	more log blocks	we al-
	     low when the pool is active which means that we flush metaslabs
	     less often	and thus decrease the number of	I/Os for spacemap up-
	     dates per TXG.  At	the same time though, that means that in the
	     event of an unclean export, there will be more log	spacemap
	     blocks for	us to read, inducing overhead in the import time of
	     the pool.	The lower the number, the amount of flushing in-
	     creases, destroying log blocks quicker as they become obsolete
	     faster, which leaves less blocks to be read during	import time
	     after a crash.

	     Each log spacemap block existing during pool import leads to ap-
	     proximately one extra logical I/O issued.	This is	the reason why
	     this tunable is exposed in	terms of blocks	rather than space

     zfs_unflushed_log_block_min=1000 (ulong)
	     If	the number of metaslabs	is small and our incoming rate is
	     high, we could get	into a situation that we are flushing all our
	     metaslabs every TXG.  Thus	we always allow	at least this many log

     zfs_unflushed_log_block_pct=400% (ulong)
	     Tunable used to determine the number of blocks that can be	used
	     for the spacemap log, expressed as	a percentage of	the total num-
	     ber of metaslabs in the pool.

     zfs_unlink_suspend_progress=0|1 (uint)
	     When enabled, files will not be asynchronously removed from the
	     list of pending unlinks and the space they	consume	will be
	     leaked.  Once this	option has been	disabled and the dataset is
	     remounted,	the pending unlinks will be processed and the freed
	     space returned to the pool.  This option is used by the test

     zfs_delete_blocks=20480 (ulong)
	     This is the used to define	a large	file for the purposes of dele-
	     tion.  Files containing more than zfs_delete_blocks will be
	     deleted asynchronously, while smaller files are deleted syn-
	     chronously.  Decreasing this value	will reduce the	time spent in
	     an	unlink(2) system call, at the expense of a longer delay	before
	     the freed space is	available.

     zfs_dirty_data_max= (int)
	     Determines	the dirty space	limit in bytes.	 Once this limit is
	     exceeded, new writes are halted until space frees up.  This pa-
	     rameter takes precedence over zfs_dirty_data_max_percent.	See

	     Defaults to physical_ram/10, capped at zfs_dirty_data_max_max.

     zfs_dirty_data_max_max= (int)
	     Maximum allowable value of	zfs_dirty_data_max, expressed in
	     bytes.  This limit	is only	enforced at module load	time, and will
	     be	ignored	if zfs_dirty_data_max is later changed.	 This parame-
	     ter takes precedence over zfs_dirty_data_max_max_percent.	See

	     Defaults to physical_ram/4,

     zfs_dirty_data_max_max_percent=25%	(int)
	     Maximum allowable value of	zfs_dirty_data_max, expressed as a
	     percentage	of physical RAM.  This limit is	only enforced at mod-
	     ule load time, and	will be	ignored	if zfs_dirty_data_max is later
	     changed.  The parameter zfs_dirty_data_max_max takes precedence
	     over this one.  See ZFS TRANSACTION DELAY.

     zfs_dirty_data_max_percent=10% (int)
	     Determines	the dirty space	limit, expressed as a percentage of
	     all memory.  Once this limit is exceeded, new writes are halted
	     until space frees up.  The	parameter zfs_dirty_data_max takes
	     precedence	over this one.	See ZFS	TRANSACTION DELAY.

	     Subject to	zfs_dirty_data_max_max.

     zfs_dirty_data_sync_percent=20% (int)
	     Start syncing out a transaction group if there's at least this
	     much dirty	data (as a percentage of zfs_dirty_data_max).  This
	     should be less than

     zfs_fallocate_reserve_percent=110%	(uint)
	     Since ZFS is a copy-on-write filesystem with snapshots, blocks
	     cannot be preallocated for	a file in order	to guarantee that
	     later writes will not run out of space.  Instead, fallocate(2)
	     space preallocation only checks that sufficient space is cur-
	     rently available in the pool or the user's	project	quota alloca-
	     tion, and then creates a sparse file of the requested size.  The
	     requested space is	multiplied by zfs_fallocate_reserve_percent to
	     allow additional space for	indirect blocks	and other internal
	     metadata.	Setting	this to	0 disables support for fallocate(2)
	     and causes	it to return EOPNOTSUPP.

     zfs_fletcher_4_impl=fastest (string)
	     Select a fletcher 4 implementation.

	     Supported selectors are: fastest, scalar, sse2, ssse3, avx2,
	     avx512f, avx512bw,	and aarch64_neon.  All except fastest and
	     scalar require instruction	set extensions to be available,	and
	     will only appear if ZFS detects that they are present at runtime.
	     If	multiple implementations of fletcher 4 are available, the
	     fastest will be chosen using a micro benchmark.  Selecting	scalar
	     results in	the original CPU-based calculation being used.	Se-
	     lecting any option	other than fastest or scalar results in	vector
	     instructions from the respective CPU instruction set being	used.

     zfs_free_bpobj_enabled=1|0	(int)
	     Enable/disable the	processing of the free_bpobj object.

     zfs_async_block_max_blocks=ULONG_MAX (unlimited) (ulong)
	     Maximum number of blocks freed in a single	TXG.

     zfs_max_async_dedup_frees=100000 (10^5) (ulong)
	     Maximum number of dedup blocks freed in a single TXG.

     zfs_override_estimate_recordsize=0	(ulong)
	     If	nonzer,	override record	size calculation for zfs send esti-

     zfs_vdev_async_read_max_active=3 (int)
	     Maximum asynchronous read I/O operations active to	each device.

     zfs_vdev_async_read_min_active=1 (int)
	     Minimum asynchronous read I/O operation active to each device.

     zfs_vdev_async_write_active_max_dirty_percent=60% (int)
	     When the pool has more than this much dirty data, use
	     zfs_vdev_async_write_max_active to	limit active async writes.  If
	     the dirty data is between the minimum and maximum,	the active I/O
	     limit is linearly interpolated.  See ZFS I/O SCHEDULER.

     zfs_vdev_async_write_active_min_dirty_percent=30% (int)
	     When the pool has less than this much dirty data, use
	     zfs_vdev_async_write_min_active to	limit active async writes.  If
	     the dirty data is between the minimum and maximum,	the active I/O
	     limit is linearly interpolated.  See ZFS I/O SCHEDULER.

     zfs_vdev_async_write_max_active=30	(int)
	     Maximum asynchronous write	I/O operations active to each device.

     zfs_vdev_async_write_min_active=2 (int)
	     Minimum asynchronous write	I/O operations active to each device.

	     Lower values are associated with better latency on	rotational me-
	     dia but poorer resilver performance.  The default value of	2 was
	     chosen as a compromise.  A	value of 3 has been shown to improve
	     resilver performance further at a cost of further increasing la-

     zfs_vdev_initializing_max_active=1	(int)
	     Maximum initializing I/O operations active	to each	device.	 See

     zfs_vdev_initializing_min_active=1	(int)
	     Minimum initializing I/O operations active	to each	device.	 See

     zfs_vdev_max_active=1000 (int)
	     The maximum number	of I/O operations active to each device.  Ide-
	     ally, this	will be	at least the sum of each queue's max_active.

     zfs_vdev_rebuild_max_active=3 (int)
	     Maximum sequential	resilver I/O operations	active to each device.

     zfs_vdev_rebuild_min_active=1 (int)
	     Minimum sequential	resilver I/O operations	active to each device.

     zfs_vdev_removal_max_active=2 (int)
	     Maximum removal I/O operations active to each device.  See	ZFS

     zfs_vdev_removal_min_active=1 (int)
	     Minimum removal I/O operations active to each device.  See	ZFS

     zfs_vdev_scrub_max_active=2 (int)
	     Maximum scrub I/O operations active to each device.  See ZFS I/O

     zfs_vdev_scrub_min_active=1 (int)
	     Minimum scrub I/O operations active to each device.  See ZFS I/O

     zfs_vdev_sync_read_max_active=10 (int)
	     Maximum synchronous read I/O operations active to each device.

     zfs_vdev_sync_read_min_active=10 (int)
	     Minimum synchronous read I/O operations active to each device.

     zfs_vdev_sync_write_max_active=10 (int)
	     Maximum synchronous write I/O operations active to	each device.

     zfs_vdev_sync_write_min_active=10 (int)
	     Minimum synchronous write I/O operations active to	each device.

     zfs_vdev_trim_max_active=2	(int)
	     Maximum trim/discard I/O operations active	to each	device.	 See

     zfs_vdev_trim_min_active=1	(int)
	     Minimum trim/discard I/O operations active	to each	device.	 See

     zfs_vdev_nia_delay=5 (int)
	     For non-interactive I/O (scrub, resilver, removal,	initialize and
	     rebuild), the number of concurrently-active I/O operations	is
	     limited to	zfs_*_min_active, unless the vdev is "idle".  When
	     there are no interactive I/O operatinons active (synchronous or
	     otherwise), and zfs_vdev_nia_delay	operations have	completed
	     since the last interactive	operation, then	the vdev is considered
	     to	be "idle", and the number of concurrently-active non-interac-
	     tive operations is	increased to zfs_*_max_active.	See ZFS	I/O

     zfs_vdev_nia_credit=5 (int)
	     Some HDDs tend to prioritize sequential I/O so strongly, that
	     concurrent	random I/O latency reaches several seconds.  On	some
	     HDDs this happens even if sequential I/O operations are submitted
	     one at a time, and	so setting zfs_*_max_active= 1 does not	help.
	     To	prevent	non-interactive	I/O, like scrub, from monopolizing the
	     device, no	more than zfs_vdev_nia_credit operations can be	sent
	     while there are outstanding incomplete interactive	operations.
	     This enforced wait	ensures	the HDD	services the interactive I/O
	     within a reasonable amount	of time.  See ZFS I/O SCHEDULER.

     zfs_vdev_queue_depth_pct=1000% (int)
	     Maximum number of queued allocations per top-level	vdev expressed
	     as	a percentage of	zfs_vdev_async_write_max_active, which allows
	     the system	to detect devices that are more	capable	of handling
	     allocations and to	allocate more blocks to	those devices.	This
	     allows for	dynamic	allocation distribution	when devices are im-
	     balanced, as fuller devices will tend to be slower	than empty de-

	     Also see zio_dva_throttle_enabled.

     zfs_expire_snapshot=300s (int)
	     Time before expiring .zfs/snapshot.

     zfs_admin_snapshot=0|1 (int)
	     Allow the creation, removal, or renaming of entries in the
	     .zfs/snapshot directory to	cause the creation, destruction, or
	     renaming of snapshots.  When enabled, this	functionality works
	     both locally and over NFS exports which have the no_root_squash
	     option set.

     zfs_flags=0 (int)
	     Set additional debugging flags.  The following flags may be bit-
	     wise-ored together:
	     |	  Value	  Symbolic Name		       Description							|
	     |	      1	  ZFS_DEBUG_DPRINTF	       Enable dprintf entries in the debug log.				|
	     |*	      2	  ZFS_DEBUG_DBUF_VERIFY	       Enable extra dbuf verifications.					|
	     |*	      4	  ZFS_DEBUG_DNODE_VERIFY       Enable extra dnode verifications.				|
	     |	      8	  ZFS_DEBUG_SNAPNAMES	       Enable snapshot name verification.				|
	     |	     16	  ZFS_DEBUG_MODIFY	       Check for illegally modified ARC	buffers.			|
	     |	     64	  ZFS_DEBUG_ZIO_FREE	       Enable verification of block frees.				|
	     |	    128	  ZFS_DEBUG_HISTOGRAM_VERIFY   Enable extra spacemap histogram verifications.			|
	     |	    256	  ZFS_DEBUG_METASLAB_VERIFY    Verify space accounting on disk matches in-memory range_trees.	|
	     |	    512	  ZFS_DEBUG_SET_ERROR	       Enable SET_ERROR	and dprintf entries in the debug log.		|
	     |	   1024	  ZFS_DEBUG_INDIRECT_REMAP     Verify split blocks created by device removal.			|
	     |	   2048	  ZFS_DEBUG_TRIM	       Verify TRIM ranges are always within the	allocatable range tree.	|
	     |	   4096	  ZFS_DEBUG_LOG_SPACEMAP       Verify that the log summary is consistent with the spacemap log	|
	     |						      and enable zfs_dbgmsgs for metaslab loading and flushing.	|

     zfs_free_leak_on_eio=0|1 (int)
	     If	destroy	encounters an EIO while	reading	metadata (e.g. indi-
	     rect blocks), space referenced by the missing metadata can	not be
	     freed.  Normally this causes the background destroy to become
	     "stalled",	as it is unable	to make	forward	progress.  While in
	     this stalled state, all remaining space to	free from the error-
	     encountering filesystem is	"temporarily leaked".  Set this	flag
	     to	cause it to ignore the EIO, permanently	leak the space from
	     indirect blocks that can not be read, and continue	to free	every-
	     thing else	that it	can.

	     The default "stalling" behavior is	useful if the storage par-
	     tially fails (i.e.	some but not all I/O operations	fail), and
	     then later	recovers.  In this case, we will be able to continue
	     pool operations while it is partially failed, and when it recov-
	     ers, we can continue to free the space, with no leaks.  Note,
	     however, that this	case is	actually fairly	rare.

	     Typically pools either
		 1. fail completely (but perhaps temporarily, e.g. due to a
		   top-level vdev going	offline), or
		 2. have localized, permanent errors (e.g. disk	returns	the
		   wrong data due to bit flip or firmware bug).
	     In	the former case, this setting does not matter because the pool
	     will be suspended and the sync thread will	not be able to make
	     forward progress regardless.  In the latter, because the error is
	     permanent,	the best we can	do is leak the minimum amount of
	     space, which is what setting this flag will do.  It is therefore
	     reasonable	for this flag to normally be set, but we chose the
	     more conservative approach	of not setting it, so that there is no
	     possibility of leaking space in the "partial temporary" failure

     zfs_free_min_time_ms=1000ms (1s) (int)
	     During a zfs destroy operation using the async_destroy feature, a
	     minimum of	this much time will be spent working on	freeing	blocks
	     per TXG.

     zfs_obsolete_min_time_ms=500ms (int)
	     Similar to	zfs_free_min_time_ms, but for cleanup of old indirec-
	     tion records for removed vdevs.

     zfs_immediate_write_sz=32768B (32kB) (long)
	     Largest data block	to write to the	ZIL.  Larger blocks will be
	     treated as	if the dataset being written to	had the
	     logbias=throughput	property set.

     zfs_initialize_value=16045690984833335022 (0xDEADBEEFDEADBEEE) (ulong)
	     Pattern written to	vdev free space	by zpool-initialize(8).

     zfs_initialize_chunk_size=1048576B	(1MB) (ulong)
	     Size of writes used by zpool-initialize(8).  This option is used
	     by	the test suite.

     zfs_livelist_max_entries=500000 (5*10^5) (ulong)
	     The threshold size	(in block pointers) at which we	create a new
	     sub-livelist.  Larger sublists are	more costly from a memory per-
	     spective but the fewer sublists there are,	the lower the cost of

     zfs_livelist_min_percent_shared=75% (int)
	     If	the amount of shared space between a snapshot and its clone
	     drops below this threshold, the clone turns off the livelist and
	     reverts to	the old	deletion method.  This is in place because
	     livelists no long give us a benefit once a	clone has been over-
	     written enough.

     zfs_livelist_condense_new_alloc=0 (int)
	     Incremented each time an extra ALLOC blkptr is added to a
	     livelist entry while it is	being condensed.  This option is used
	     by	the test suite to track	race conditions.

     zfs_livelist_condense_sync_cancel=0 (int)
	     Incremented each time livelist condensing is canceled while in
	     spa_livelist_condense_sync().  This option	is used	by the test
	     suite to track race conditions.

     zfs_livelist_condense_sync_pause=0|1 (int)
	     When set, the livelist condense process pauses indefinitely be-
	     fore executing the	synctask - spa_livelist_condense_sync().  This
	     option is used by the test	suite to trigger race conditions.

     zfs_livelist_condense_zthr_cancel=0 (int)
	     Incremented each time livelist condensing is canceled while in
	     spa_livelist_condense_cb().  This option is used by the test
	     suite to track race conditions.

     zfs_livelist_condense_zthr_pause=0|1 (int)
	     When set, the livelist condense process pauses indefinitely be-
	     fore executing the	open context condensing	work in
	     spa_livelist_condense_cb().  This option is used by the test
	     suite to trigger race conditions.

     zfs_lua_max_instrlimit=100000000 (10^8) (ulong)
	     The maximum execution time	limit that can be set for a ZFS	chan-
	     nel program, specified as a number	of Lua instructions.

     zfs_lua_max_memlimit=104857600 (100MB) (ulong)
	     The maximum memory	limit that can be set for a ZFS	channel	pro-
	     gram, specified in	bytes.

     zfs_max_dataset_nesting=50	(int)
	     The maximum depth of nested datasets.  This value can be tuned
	     temporarily to fix	existing datasets that exceed the predefined

     zfs_max_log_walking=5 (ulong)
	     The number	of past	TXGs that the flushing algorithm of the	log
	     spacemap feature uses to estimate incoming	log blocks.

     zfs_max_logsm_summary_length=10 (ulong)
	     Maximum number of rows allowed in the summary of the spacemap

     zfs_max_recordsize=1048576	(1MB) (int)
	     We	currently support block	sizes from 512B	to 16MB.  The benefits
	     of	larger blocks, and thus	larger I/O, need to be weighed against
	     the cost of COWing	a giant	block to modify	one byte.  Addition-
	     ally, very	large blocks can have an impact	on I/O latency,	and
	     also potentially on the memory allocator.	Therefore, we do not
	     allow the recordsize to be	set larger than	this tunable.  Larger
	     blocks can	be created by changing it, and pools with larger
	     blocks can	always be imported and used, regardless	of this	set-

     zfs_allow_redacted_dataset_mount=0|1 (int)
	     Allow datasets received with redacted send/receive	to be mounted.
	     Normally disabled because these datasets may be missing key data.

     zfs_min_metaslabs_to_flush=1 (ulong)
	     Minimum number of metaslabs to flush per dirty TXG.

     zfs_metaslab_fragmentation_threshold=70% (int)
	     Allow metaslabs to	keep their active state	as long	as their frag-
	     mentation percentage is no	more than this value.  An active
	     metaslab that exceeds this	threshold will no longer keep its ac-
	     tive status allowing better metaslabs to be selected.

     zfs_mg_fragmentation_threshold=95%	(int)
	     Metaslab groups are considered eligible for allocations if	their
	     fragmentation metric (measured as a percentage) is	less than or
	     equal to this value.  If a	metaslab group exceeds this threshold
	     then it will be skipped unless all	metaslab groups	within the
	     metaslab class have also crossed this threshold.

     zfs_mg_noalloc_threshold=0% (int)
	     Defines a threshold at which metaslab groups should be eligible
	     for allocations.  The value is expressed as a percentage of free
	     space beyond which	a metaslab group is always eligible for	allo-
	     cations.  If a metaslab group's free space	is less	than or	equal
	     to	the threshold, the allocator will avoid	allocating to that
	     group unless all groups in	the pool have reached the threshold.
	     Once all groups have reached the threshold, all groups are	al-
	     lowed to accept allocations.  The default value of	0 disables the
	     feature and causes	all metaslab groups to be eligible for alloca-

	     This parameter allows one to deal with pools having heavily im-
	     balanced vdevs such as would be the case when a new vdev has been
	     added.  Setting the threshold to a	non-zero percentage will stop
	     allocations from being made to vdevs that aren't filled to	the
	     specified percentage and allow lesser filled vdevs	to acquire
	     more allocations than they	otherwise would	under the old
	     zfs_mg_alloc_failures facility.

     zfs_ddt_data_is_special=1|0 (int)
	     If	enabled, ZFS will place	DDT data into the special allocation

     zfs_user_indirect_is_special=1|0 (int)
	     If	enabled, ZFS will place	user data indirect blocks into the
	     special allocation	class.

     zfs_multihost_history=0 (int)
	     Historical	statistics for this many latest	multihost updates will
	     be	available in /proc/spl/kstat/zfs/<pool>/multihost.

     zfs_multihost_interval=1000ms (1s)	(ulong)
	     Used to control the frequency of multihost	writes which are per-
	     formed when the multihost pool property is	on.  This is one of
	     the factors used to determine the length of the activity check
	     during import.

	     The multihost write period	is zfs_multihost_interval /
	     leaf-vdevs.  On average a multihost write will be issued for each
	     leaf vdev every zfs_multihost_interval milliseconds.  In prac-
	     tice, the observed	period can vary	with the I/O load and this ob-
	     served value is the delay which is	stored in the uberblock.

     zfs_multihost_import_intervals=20 (uint)
	     Used to control the duration of the activity test on import.
	     Smaller values of zfs_multihost_import_intervals will reduce the
	     import time but increase the risk of failing to detect an active
	     pool.  The	total activity check time is never allowed to drop be-
	     low one second.

	     On	import the activity check waits	a minimum amount of time de-
	     termined by zfs_multihost_interval	*
	     zfs_multihost_import_intervals, or	the same product computed on
	     the host which last had the pool imported,	whichever is greater.
	     The activity check	time may be further extended if	the value of
	     MMP delay found in	the best uberblock indicates actual multihost
	     updates happened at longer	intervals than zfs_multihost_interval.
	     A minimum of 100ms	is enforced.

	     0 is equivalent to	1.

     zfs_multihost_fail_intervals=10 (uint)
	     Controls the behavior of the pool when multihost write failures
	     or	delays are detected.

	     When 0, multihost write failures or delays	are ignored.  The
	     failures will still be reported to	the ZED	which depending	on its
	     configuration may take action such	as suspending the pool or of-
	     flining a device.

	     Otherwise,	the pool will be suspended if
	     zfs_multihost_fail_intervals * zfs_multihost_interval millisec-
	     onds pass without a successful MMP	write.	This guarantees	the
	     activity test will	see MMP	writes if the pool is imported.	 1 is
	     equivalent	to 2; this is necessary	to prevent the pool from being
	     suspended due to normal, small I/O	latency	variations.

     zfs_no_scrub_io=0|1 (int)
	     Set to disable scrub I/O.	This results in	scrubs not actually
	     scrubbing data and	simply doing a metadata	crawl of the pool in-

     zfs_no_scrub_prefetch=0|1 (int)
	     Set to disable block prefetching for scrubs.

     zfs_nocacheflush=0|1 (int)
	     Disable cache flush operations on disks when writing.  Setting
	     this will cause pool corruption on	power loss if a	volatile out-
	     of-order write cache is enabled.

     zfs_nopwrite_enabled=1|0 (int)
	     Allow no-operation	writes.	 The occurrence	of nopwrites will fur-
	     ther depend on other pool properties (i.a.	the checksumming and
	     compression algorithms).

     zfs_dmu_offset_next_sync=0|1 (int)
	     Enable forcing TXG	sync to	find holes.  When enabled forces ZFS
	     to	act like prior versions	when SEEK_HOLE or SEEK_DATA flags are
	     used, which, when a dnode is dirty, causes	TXGs to	be synced so
	     that this data can	be found.

     zfs_pd_bytes_max=52428800B	(50MB) (int)
	     The number	of bytes which should be prefetched during a pool tra-
	     versal, like zfs send or other data crawling operations.

     zfs_traverse_indirect_prefetch_limit=32 (int)
	     The number	of blocks pointed by indirect (non-L0) block which
	     should be prefetched during a pool	traversal, like	zfs send or
	     other data	crawling operations.

     zfs_per_txg_dirty_frees_percent=5%	(ulong)
	     Control percentage	of dirtied indirect blocks from	frees allowed
	     into one TXG.  After this threshold is crossed, additional	frees
	     will wait until the next TXG.  0 disables this throttle.

     zfs_prefetch_disable=0|1 (int)
	     Disable predictive	prefetch.  Note	that it	leaves "prescient"
	     prefetch (for. e.g. zfs send) intact.  Unlike predictive
	     prefetch, prescient prefetch never	issues I/O that	ends up	not
	     being needed, so it can't hurt performance.

     zfs_qat_checksum_disable=0|1 (int)
	     Disable QAT hardware acceleration for SHA256 checksums.  May be
	     unset after the ZFS modules have been loaded to initialize	the
	     QAT hardware as long as support is	compiled in and	the QAT	driver
	     is	present.

     zfs_qat_compress_disable=0|1 (int)
	     Disable QAT hardware acceleration for gzip	compression.  May be
	     unset after the ZFS modules have been loaded to initialize	the
	     QAT hardware as long as support is	compiled in and	the QAT	driver
	     is	present.

     zfs_qat_encrypt_disable=0|1 (int)
	     Disable QAT hardware acceleration for AES-GCM encryption.	May be
	     unset after the ZFS modules have been loaded to initialize	the
	     QAT hardware as long as support is	compiled in and	the QAT	driver
	     is	present.

     zfs_vnops_read_chunk_size=1048576B	(1MB) (long)
	     Bytes to read per chunk.

     zfs_read_history=0	(int)
	     Historical	statistics for this many latest	reads will be avail-
	     able in /proc/spl/kstat/zfs/<pool>/reads.

     zfs_read_history_hits=0|1 (int)
	     Include cache hits	in read	history

     zfs_rebuild_max_segment=1048576B (1MB) (ulong)
	     Maximum read segment size to issue	when sequentially resilvering
	     a top-level vdev.

     zfs_rebuild_scrub_enabled=1|0 (int)
	     Automatically start a pool	scrub when the last active sequential
	     resilver completes	in order to verify the checksums of all	blocks
	     which have	been resilvered.  This is enabled by default and
	     strongly recommended.

     zfs_rebuild_vdev_limit=33554432B (32MB) (ulong)
	     Maximum amount of I/O that	can be concurrently issued for a se-
	     quential resilver per leaf	device,	given in bytes.

     zfs_reconstruct_indirect_combinations_max=4096 (int)
	     If	an indirect split block	contains more than this	many possible
	     unique combinations when being reconstructed, consider it too
	     computationally expensive to check	them all.  Instead, try	at
	     most this many randomly selected combinations each	time the block
	     is	accessed.  This	allows all segment copies to participate
	     fairly in the reconstruction when all combinations	cannot be
	     checked and prevents repeated use of one bad copy.

     zfs_recover=0|1 (int)
	     Set to attempt to recover from fatal errors.  This	should only be
	     used as a last resort, as it typically results in leaked space,
	     or	worse.

     zfs_removal_ignore_errors=0|1 (int)
	     Ignore hard IO errors during device removal.  When	set, if	a de-
	     vice encounters a hard IO error during the	removal	process	the
	     removal will not be cancelled.  This can result in	a normally re-
	     coverable block becoming permanently damaged and is hence not
	     recommended.  This	should only be used as a last resort when the
	     pool cannot be returned to	a healthy state	prior to removing the

     zfs_removal_suspend_progress=0|1 (int)
	     This is used by the test suite so that it can ensure that certain
	     actions happen while in the middle	of a removal.

     zfs_remove_max_segment=16777216B (16MB) (int)
	     The largest contiguous segment that we will attempt to allocate
	     when removing a device.  If there is a performance	problem	with
	     attempting	to allocate large blocks, consider decreasing this.
	     The default value is also the maximum.

     zfs_resilver_disable_defer=0|1 (int)
	     Ignore the	resilver_defer feature,	causing	an operation that
	     would start a resilver to immediately restart the one in

     zfs_resilver_min_time_ms=3000ms (3s) (int)
	     Resilvers are processed by	the sync thread.  While	resilvering,
	     it	will spend at least this much time working on a	resilver be-
	     tween TXG flushes.

     zfs_scan_ignore_errors=0|1	(int)
	     If	set, remove the	DTL (dirty time	list) upon completion of a
	     pool scan (scrub),	even if	there were unrepairable	errors.	 In-
	     tended to be used during pool repair or recovery to stop resil-
	     vering when the pool is next imported.

     zfs_scrub_min_time_ms=1000ms (1s) (int)
	     Scrubs are	processed by the sync thread.  While scrubbing,	it
	     will spend	at least this much time	working	on a scrub between TXG

     zfs_scan_checkpoint_intval=7200s (2h) (int)
	     To	preserve progress across reboots, the sequential scan algo-
	     rithm periodically	needs to stop metadata scanning	and issue all
	     the verification I/O to disk.  The	frequency of this flushing is
	     determined	by this	tunable.

     zfs_scan_fill_weight=3 (int)
	     This tunable affects how scrub and	resilver I/O segments are or-
	     dered.  A higher number indicates that we care more about how
	     filled in a segment is, while a lower number indicates we care
	     more about	the size of the	extent without considering the gaps
	     within a segment.	This value is only tunable upon	module inser-
	     tion.  Changing the value afterwards will have no affect on scrub
	     or	resilver performance.

     zfs_scan_issue_strategy=0 (int)
	     Determines	the order that data will be verified while scrubbing
	     or	resilvering:
		 1  Data will be verified as sequentially as possible, given
		    the	amount of memory reserved for scrubbing	(see
		    zfs_scan_mem_lim_fact).  This may improve scrub perfor-
		    mance if the pool's	data is	very fragmented.
		 2  The	largest	mostly-contiguous chunk	of found data will be
		    verified first.  By	deferring scrubbing of small segments,
		    we may later find adjacent data to coalesce	and increase
		    the	segment	size.
		 0  Use	strategy 1 during normal verification and strategy 2
		    while taking a checkpoint.

     zfs_scan_legacy=0|1 (int)
	     If	unset, indicates that scrubs and resilvers will	gather meta-
	     data in memory before issuing sequential I/O.  Otherwise indi-
	     cates that	the legacy algorithm will be used, where I/O is	initi-
	     ated as soon as it	is discovered.	Unsetting will not affect
	     scrubs or resilvers that are already in progress.

     zfs_scan_max_ext_gap=2097152B (2MB) (int)
	     Sets the largest gap in bytes between scrub/resilver I/O opera-
	     tions that	will still be considered sequential for	sorting	pur-
	     poses.  Changing this value will not affect scrubs	or resilvers
	     that are already in progress.

     zfs_scan_mem_lim_fact=20^-1 (int)
	     Maximum fraction of RAM used for I/O sorting by sequential	scan
	     algorithm.	 This tunable determines the hard limit	for I/O	sort-
	     ing memory	usage.	When the hard limit is reached we stop scan-
	     ning metadata and start issuing data verification I/O.  This is
	     done until	we get below the soft limit.

     zfs_scan_mem_lim_soft_fact=20^-1 (int)
	     The fraction of the hard limit used to determined the soft	limit
	     for I/O sorting by	the sequential scan algorithm.	When we	cross
	     this limit	from below no action is	taken.	When we	cross this
	     limit from	above it is because we are issuing verification	I/O.
	     In	this case (unless the metadata scan is done) we	stop issuing
	     verification I/O and start	scanning metadata again	until we get
	     to	the hard limit.

     zfs_scan_strict_mem_lim=0|1 (int)
	     Enforce tight memory limits on pool scans when a sequential scan
	     is	in progress.  When disabled, the memory	limit may be exceeded
	     by	fast disks.

     zfs_scan_suspend_progress=0|1 (int)
	     Freezes a scrub/resilver in progress without actually pausing it.
	     Intended for testing/debugging.

     zfs_scan_vdev_limit=4194304B (4MB)	(int)
	     Maximum amount of data that can be	concurrently issued at once
	     for scrubs	and resilvers per leaf device, given in	bytes.

     zfs_send_corrupt_data=0|1 (int)
	     Allow sending of corrupt data (ignore read/checksum errors	when

     zfs_send_unmodified_spill_blocks=1|0 (int)
	     Include unmodified	spill blocks in	the send stream.  Under	cer-
	     tain circumstances, previous versions of ZFS could	incorrectly
	     remove the	spill block from an existing object.  Including	unmod-
	     ified copies of the spill blocks creates a	backwards-compatible
	     stream which will recreate	a spill	block if it was	incorrectly

     zfs_send_no_prefetch_queue_ff=20^-1 (int)
	     The fill fraction of the zfs send internal	queues.	 The fill
	     fraction controls the timing with which internal threads are wo-
	     ken up.

     zfs_send_no_prefetch_queue_length=1048576B	(1MB) (int)
	     The maximum number	of bytes allowed in zfs	send's internal

     zfs_send_queue_ff=20^-1 (int)
	     The fill fraction of the zfs send prefetch	queue.	The fill frac-
	     tion controls the timing with which internal threads are woken

     zfs_send_queue_length=16777216B (16MB) (int)
	     The maximum number	of bytes allowed that will be prefetched by
	     zfs send.	This value must	be at least twice the maximum block
	     size in use.

     zfs_recv_queue_ff=20^-1 (int)
	     The fill fraction of the zfs receive queue.  The fill fraction
	     controls the timing with which internal threads are woken up.

     zfs_recv_queue_length=16777216B (16MB) (int)
	     The maximum number	of bytes allowed in the	zfs receive queue.
	     This value	must be	at least twice the maximum block size in use.

     zfs_recv_write_batch_size=1048576B	(1MB) (int)
	     The maximum amount	of data, in bytes, that	zfs receive will write
	     in	one DMU	transaction.  This is the uncompressed size, even when
	     receiving a compressed send stream.  This setting will not	reduce
	     the write size below a single block.  Capped at a maximum of

     zfs_override_estimate_recordsize=0|1 (ulong)
	     Setting this variable overrides the default logic for estimating
	     block sizes when doing a zfs send.	 The default heuristic is that
	     the average block size will be the	current	recordsize.  Override
	     this value	if most	data in	your dataset is	not of that size and
	     you require accurate zfs send size	estimates.

     zfs_sync_pass_deferred_free=2 (int)
	     Flushing of data to disk is done in passes.  Defer	frees starting
	     in	this pass.

     zfs_spa_discard_memory_limit=16777216B (16MB) (int)
	     Maximum memory used for prefetching a checkpoint's	space map on
	     each vdev while discarding	the checkpoint.

     zfs_special_class_metadata_reserve_pct=25%	(int)
	     Only allow	small data blocks to be	allocated on the special and
	     dedup vdev	types when the available free space percentage on
	     these vdevs exceeds this value.  This ensures reserved space is
	     available for pool	metadata as the	special	vdevs approach capac-

     zfs_sync_pass_dont_compress=8 (int)
	     Starting in this sync pass, disable compression (including	of
	     metadata).	 With the default setting, in practice,	we don't have
	     this many sync passes, so this has	no effect.

	     The original intent was that disabling compression	would help the
	     sync passes to converge.  However,	in practice, disabling com-
	     pression increases	the average number of sync passes; because
	     when we turn compression off, many	blocks'	size will change, and
	     thus we have to re-allocate (not overwrite) them.	It also	in-
	     creases the number	of 128kB allocations (e.g. for indirect	blocks
	     and spacemaps) because these will not be compressed.  The 128kB
	     allocations are especially	detrimental to performance on highly
	     fragmented	systems, which may have	very few free segments of this
	     size, and may need	to load	new metaslabs to satisfy these alloca-

     zfs_sync_pass_rewrite=2 (int)
	     Rewrite new block pointers	starting in this pass.

     zfs_sync_taskq_batch_pct=75% (int)
	     This controls the number of threads used by dp_sync_taskq.	 The
	     default value of 75% will create a	maximum	of one thread per CPU.

     zfs_trim_extent_bytes_max=134217728B (128MB) (uint)
	     Maximum size of TRIM command.  Larger ranges will be split	into
	     chunks no larger than this	value before issuing.

     zfs_trim_extent_bytes_min=32768B (32kB) (uint)
	     Minimum size of TRIM commands.  TRIM ranges smaller than this
	     will be skipped, unless they're part of a larger range which was
	     chunked.  This is done because it's common	for these small	TRIMs
	     to	negatively impact overall performance.

     zfs_trim_metaslab_skip=0|1	(uint)
	     Skip uninitialized	metaslabs during the TRIM process.  This op-
	     tion is useful for	pools constructed from large thinly-provi-
	     sioned devices where TRIM operations are slow.  As	a pool ages,
	     an	increasing fraction of the pool's metaslabs will be initial-
	     ized, progressively degrading the usefulness of this option.
	     This setting is stored when starting a manual TRIM	and will per-
	     sist for the duration of the requested TRIM.

     zfs_trim_queue_limit=10 (uint)
	     Maximum number of queued TRIMs outstanding	per leaf vdev.	The
	     number of concurrent TRIM commands	issued to the device is	con-
	     trolled by	zfs_vdev_trim_min_active and zfs_vdev_trim_max_active.

     zfs_trim_txg_batch=32 (uint)
	     The number	of transaction groups' worth of	frees which should be
	     aggregated	before TRIM operations are issued to the device.  This
	     setting represents	a trade-off between issuing larger, more effi-
	     cient TRIM	operations and the delay before	the recently trimmed
	     space is available	for use	by the device.

	     Increasing	this value will	allow frees to be aggregated for a
	     longer time.  This	will result is larger TRIM operations and po-
	     tentially increased memory	usage.	Decreasing this	value will
	     have the opposite effect.	The default of 32 was determined to be
	     a reasonable compromise.

     zfs_txg_history=0 (int)
	     Historical	statistics for this many latest	TXGs will be available
	     in	/proc/spl/kstat/zfs/<pool>/TXGs.

     zfs_txg_timeout=5s	(int)
	     Flush dirty data to disk at least every this many seconds (maxi-
	     mum TXG duration).

     zfs_vdev_aggregate_trim=0|1 (int)
	     Allow TRIM	I/Os to	be aggregated.	This is	normally not helpful
	     because the extents to be trimmed will have been already been ag-
	     gregated by the metaslab.	This option is provided	for debugging
	     and performance analysis.

     zfs_vdev_aggregation_limit=1048576B (1MB) (int)
	     Max vdev I/O aggregation size.

     zfs_vdev_aggregation_limit_non_rotating=131072B (128kB) (int)
	     Max vdev I/O aggregation size for non-rotating media.

     zfs_vdev_cache_bshift=16 (64kB) (int)
	     Shift size	to inflate reads to.

     zfs_vdev_cache_max=16384B (16kB) (int)
	     Inflate reads smaller than	this value to meet the
	     zfs_vdev_cache_bshift size	(default 64kB).

     zfs_vdev_cache_size=0 (int)
	     Total size	of the per-disk	cache in bytes.

	     Currently this feature is disabled, as it has been	found to not
	     be	helpful	for performance	and in some cases harmful.

     zfs_vdev_mirror_rotating_inc=0 (int)
	     A number by which the balancing algorithm increments the load
	     calculation for the purpose of selecting the least	busy mirror
	     member when an I/O	operation immediately follows its predecessor
	     on	rotational vdevs for the purpose of making decisions based on

     zfs_vdev_mirror_rotating_seek_inc=5 (int)
	     A number by which the balancing algorithm increments the load
	     calculation for the purpose of selecting the least	busy mirror
	     member when an I/O	operation lacks	locality as defined by
	     zfs_vdev_mirror_rotating_seek_offset.  Operations within this
	     that are not immediately following	the previous operation are in-
	     cremented by half.

     zfs_vdev_mirror_rotating_seek_offset=1048576B (1MB) (int)
	     The maximum distance for the last queued I/O operation in which
	     the balancing algorithm considers an operation to have locality.

     zfs_vdev_mirror_non_rotating_inc=0	(int)
	     A number by which the balancing algorithm increments the load
	     calculation for the purpose of selecting the least	busy mirror
	     member on non-rotational vdevs when I/O operations	do not immedi-
	     ately follow one another.

     zfs_vdev_mirror_non_rotating_seek_inc=1 (int)
	     A number by which the balancing algorithm increments the load
	     calculation for the purpose of selecting the least	busy mirror
	     member when an I/O	operation lacks	locality as defined by the
	     zfs_vdev_mirror_rotating_seek_offset.  Operations within this
	     that are not immediately following	the previous operation are in-
	     cremented by half.

     zfs_vdev_read_gap_limit=32768B (32kB) (int)
	     Aggregate read I/O	operations if the on-disk gap between them is
	     within this threshold.

     zfs_vdev_write_gap_limit=4096B (4kB) (int)
	     Aggregate write I/O operations if the on-disk gap between them is
	     within this threshold.

     zfs_vdev_raidz_impl=fastest (string)
	     Select the	raidz parity implementation to use.

	     Variants that don't depend	on CPU-specific	features may be	se-
	     lected on module load, as they are	supported on all systems.  The
	     remaining options may only	be set after the module	is loaded, as
	     they are available	only if	the implementations are	compiled in
	     and supported on the running system.

	     Once the module is	loaded,
	     /sys/module/zfs/parameters/zfs_vdev_raidz_impl will show the
	     available options,	with the currently selected one	enclosed in
	     square brackets.

	     fastest	       selected	by built-in benchmark
	     original	       original	implementation
	     scalar	       scalar implementation
	     sse2	       SSE2 instruction	set		     64-bit x86
	     ssse3	       SSSE3 instruction set		     64-bit x86
	     avx2	       AVX2 instruction	set		     64-bit x86
	     avx512f	       AVX512F instruction set		     64-bit x86
	     avx512bw	       AVX512F & AVX512BW instruction sets   64-bit x86
	     aarch64_neon      NEON				     Aarch64/64-bit ARMv8
	     aarch64_neonx2    NEON with more unrolling		     Aarch64/64-bit ARMv8
	     powerpc_altivec   Altivec				     PowerPC

     zfs_vdev_scheduler	(charp)
	     DEPRECATED.  Prints warning to kernel log for compatibility.

     zfs_zevent_len_max=512 (int)
	     Max event queue length.  Events in	the queue can be viewed	with

     zfs_zevent_retain_max=2000	(int)
	     Maximum recent zevent records to retain for duplicate checking.
	     Setting this to 0 disables	duplicate detection.

     zfs_zevent_retain_expire_secs=900s	(15min)	(int)
	     Lifespan for a recent ereport that	was retained for duplicate

     zfs_zil_clean_taskq_maxalloc=1048576 (int)
	     The maximum number	of taskq entries that are allowed to be
	     cached.  When this	limit is exceeded transaction records (itxs)
	     will be cleaned synchronously.

     zfs_zil_clean_taskq_minalloc=1024 (int)
	     The number	of taskq entries that are pre-populated	when the taskq
	     is	first created and are immediately available for	use.

     zfs_zil_clean_taskq_nthr_pct=100% (int)
	     This controls the number of threads used by dp_zil_clean_taskq.
	     The default value of 100% will create a maximum of	one thread per

     zil_maxblocksize=131072B (128kB) (int)
	     This sets the maximum block size used by the ZIL.	On very	frag-
	     mented pools, lowering this (typically to 36kB) can improve per-

     zil_nocacheflush=0|1 (int)
	     Disable the cache flush commands that are normally	sent to	disk
	     by	the ZIL	after an LWB write has completed.  Setting this	will
	     cause ZIL corruption on power loss	if a volatile out-of-order
	     write cache is enabled.

     zil_replay_disable=0|1 (int)
	     Disable intent logging replay.  Can be disabled for recovery from
	     corrupted ZIL.

     zil_slog_bulk=786432B (768kB) (ulong)
	     Limit SLOG	write size per commit executed with synchronous	prior-
	     ity.  Any writes above that will be executed with lower (asyn-
	     chronous) priority	to limit potential SLOG	device abuse by	single
	     active ZIL	writer.

     zfs_embedded_slog_min_ms=64 (int)
	     Usually, one metaslab from	each normal-class vdev is dedicated
	     for use by	the ZIL	to log synchronous writes.  However, if	there
	     are fewer than zfs_embedded_slog_min_ms metaslabs in the vdev,
	     this functionality	is disabled.  This ensures that	we don't set
	     aside an unreasonable amount of space for the ZIL.

     zio_deadman_log_all=0|1 (int)
	     If	non-zero, the zio deadman will produce debugging messages (see
	     zfs_dbgmsg_enable)	for all	zios, rather than only for leaf	zios
	     possessing	a vdev.	 This is meant to be used by developers	to
	     gain diagnostic information for hang conditions which don't in-
	     volve a mutex or other locking primitive: typically conditions in
	     which a thread in the zio pipeline	is looping indefinitely.

     zio_slow_io_ms=30000ms (30s) (int)
	     When an I/O operation takes more than this	much time to complete,
	     it's marked as slow.  Each	slow operation causes a	delay zevent.
	     Slow I/O counters can be seen with	zpool status -s.

     zio_dva_throttle_enabled=1|0 (int)
	     Throttle block allocations	in the I/O pipeline.  This allows for
	     dynamic allocation	distribution when devices are imbalanced.
	     When enabled, the maximum number of pending allocations per top-
	     level vdev	is limited by zfs_vdev_queue_depth_pct.

     zio_requeue_io_start_cut_in_line=0|1 (int)
	     Prioritize	requeued I/O.

     zio_taskq_batch_pct=80% (uint)
	     Percentage	of online CPUs which will run a	worker thread for I/O.
	     These workers are responsible for I/O work	such as	compression
	     and checksum calculations.	 Fractional number of CPUs will	be
	     rounded down.

	     The default value of 80% was chosen to avoid using	all CPUs which
	     can result	in latency issues and inconsistent application perfor-
	     mance, especially when slower compression and/or checksumming is

     zio_taskq_batch_tpq=0 (uint)
	     Number of worker threads per taskq.  Lower	values improve I/O or-
	     dering and	CPU utilization, while higher reduces lock contention.

	     If	0, generate a system-dependent value close to 6	threads	per

     zvol_inhibit_dev=0|1 (uint)
	     Do	not create zvol	device nodes.  This may	slightly improve
	     startup time on systems with a very large number of zvols.

     zvol_major=230 (uint)
	     Major number for zvol block devices.

     zvol_max_discard_blocks=16384 (ulong)
	     Discard (TRIM) operations done on zvols will be done in batches
	     of	this many blocks, where	block size is determined by the
	     volblocksize property of a	zvol.

     zvol_prefetch_bytes=131072B (128kB) (uint)
	     When adding a zvol	to the system, prefetch	this many bytes	from
	     the start and end of the volume.  Prefetching these regions of
	     the volume	is desirable, because they are likely to be accessed
	     immediately by blkid(8) or	the kernel partitioner.

     zvol_request_sync=0|1 (uint)
	     When processing I/O requests for a	zvol, submit them syn-
	     chronously.  This effectively limits the queue depth to 1 for
	     each I/O submitter.  When unset, requests are handled asyn-
	     chronously	by a thread pool.  The number of requests which	can be
	     handled concurrently is controlled	by zvol_threads.

     zvol_threads=32 (uint)
	     Max number	of threads which can handle zvol I/O requests concur-

     zvol_volmode=1 (uint)
	     Defines zvol block	devices	behaviour when volmode=default:
		 1  equivalent to full
		 2  equivalent to dev
		 3  equivalent to none

     ZFS issues	I/O operations to leaf vdevs to	satisfy	and complete I/O oper-
     ations.  The scheduler determines when and	in what	order those operations
     are issued.  The scheduler	divides	operations into	five I/O classes, pri-
     oritized in the following order: sync read, sync write, async read, async
     write, and	scrub/resilver.	 Each queue defines the	minimum	and maximum
     number of concurrent operations that may be issued	to the device.	In ad-
     dition, the device	has an aggregate maximum, zfs_vdev_max_active.	Note
     that the sum of the per-queue minima must not exceed the aggregate	maxi-
     mum.  If the sum of the per-queue maxima exceeds the aggregate maximum,
     then the number of	active operations may reach zfs_vdev_max_active, in
     which case	no further operations will be issued, regardless of whether
     all per-queue minima have been met.

     For many physical devices,	throughput increases with the number of	con-
     current operations, but latency typically suffers.	 Furthermore, physical
     devices typically have a limit at which more concurrent operations	have
     no	effect on throughput or	can actually cause it to decrease.

     The scheduler selects the next operation to issue by first	looking	for an
     I/O class whose minimum has not been satisfied.  Once all are satisfied
     and the aggregate maximum has not been hit, the scheduler looks for
     classes whose maximum has not been	satisfied.  Iteration through the I/O
     classes is	done in	the order specified above.  No further operations are
     issued if the aggregate maximum number of concurrent operations has been
     hit, or if	there are no operations	queued for an I/O class	that has not
     hit its maximum.  Every time an I/O operation is queued or	an operation
     completes,	the scheduler looks for	new operations to issue.

     In	general, smaller max_actives will lead to lower	latency	of synchronous
     operations.  Larger max_actives may lead to higher	overall	throughput,
     depending on underlying storage.

     The ratio of the queues' max_actives determines the balance of perfor-
     mance between reads, writes, and scrubs.  For example, increasing
     zfs_vdev_scrub_max_active will cause the scrub or resilver	to complete
     more quickly, but reads and writes	to have	higher latency and lower

     All I/O classes have a fixed maximum number of outstanding	operations,
     except for	the async write	class.	Asynchronous writes represent the data
     that is committed to stable storage during	the syncing stage for transac-
     tion groups.  Transaction groups enter the	syncing	state periodically, so
     the number	of queued async	writes will quickly burst up and then bleed
     down to zero.  Rather than	servicing them as quickly as possible, the I/O
     scheduler changes the maximum number of active async write	operations ac-
     cording to	the amount of dirty data in the	pool.  Since both throughput
     and latency typically increase with the number of concurrent operations
     issued to physical	devices, reducing the burstiness in the	number of con-
     current operations	also stabilizes	the response time of operations	from
     other a and in particular synchronous a queues.  In broad strokes,	the
     I/O scheduler will	issue more concurrent operations from the async	write
     queue as there's more dirty data in the pool.

   Async Writes
     The number	of concurrent operations issued	for the	async write I/O	class
     follows a piece-wise linear function defined by a few adjustable points:

	    |		   o---------| <-- zfs_vdev_async_write_max_active
       ^    |		  /^	     |
       |    |		 / |	     |
     active |		/  |	     |
      I/O   |	       /   |	     |
     count  |	      /	   |	     |
	    |	     /	   |	     |
	    |-------o	   |	     | <-- zfs_vdev_async_write_min_active
	    0%	    |	   |	   100%	of zfs_dirty_data_max
		    |	   |
		    |	   `-- zfs_vdev_async_write_active_max_dirty_percent
		    `--------- zfs_vdev_async_write_active_min_dirty_percent

     Until the amount of dirty data exceeds a minimum percentage of the	dirty
     data allowed in the pool, the I/O scheduler will limit the	number of con-
     current operations	to the minimum.	 As that threshold is crossed, the
     number of concurrent operations issued increases linearly to the maximum
     at	the specified maximum percentage of the	dirty data allowed in the

     Ideally, the amount of dirty data on a busy pool will stay	in the sloped
     part of the function between
     zfs_vdev_async_write_active_min_dirty_percent and
     zfs_vdev_async_write_active_max_dirty_percent.  If	it exceeds the maximum
     percentage, this indicates	that the rate of incoming data is greater than
     the rate that the backend storage can handle.  In this case, we must fur-
     ther throttle incoming writes, as described in the	next section.

     We	delay transactions when	we've determined that the backend storage
     isn't able	to accommodate the rate	of incoming writes.

     If	there is already a transaction waiting,	we delay relative to when that
     transaction will finish waiting.  This way	the calculated delay time is
     independent of the	number of threads concurrently executing transactions.

     If	we are the only	waiter,	wait relative to when the transaction started,
     rather than the current time.  This credits the transaction for "time al-
     ready served", e.g. reading indirect blocks.

     The minimum time for a transaction	to take	is calculated as
	   min_time = min(zfs_delay_scale * (dirty - min) / (max - dirty),

     The delay has two degrees of freedom that can be adjusted via tunables.
     The percentage of dirty data at which we start to delay is	defined	by
     zfs_delay_min_dirty_percent.  This	should typically be at or above
     zfs_vdev_async_write_active_max_dirty_percent, so that we only start to
     delay after writing at full speed has failed to keep up with the incoming
     write rate.  The scale of the curve is defined by zfs_delay_scale.
     Roughly speaking, this variable determines	the amount of delay at the
     midpoint of the curve.

      10ms +-------------------------------------------------------------*+
	   |								 *|
       9ms +								 *+
	   |								 *|
       8ms +								 *+
	   |								* |
       7ms +								* +
	   |								* |
       6ms +								* +
	   |								* |
       5ms +							       *  +
	   |							       *  |
       4ms +							       *  +
	   |							       *  |
       3ms +							      *	  +
	   |							      *	  |
       2ms +						  (midpoint) *	  +
	   |						      |	   **	  |
       1ms +						      v	***	  +
	   |		 zfs_delay_scale ---------->	 ********	  |
	 0 +-------------------------------------*********----------------+
	   0%			 <- zfs_dirty_data_max ->		100%

     Note, that	since the delay	is added to the	outstanding time remaining on
     the most recent transaction it's effectively the inverse of IOPS.	Here,
     the midpoint of 500us translates to 2000 IOPS.  The shape of the curve
     was chosen	such that small	changes	in the amount of accumulated dirty
     data in the first three quarters of the curve yield relatively small dif-
     ferences in the amount of delay.

     The effects can be	easier to understand when the amount of	delay is rep-
     resented on a logarithmic scale:

     100ms +-------------------------------------------------------------++
	   +								  +
	   |								  |
	   +								 *+
      10ms +								 *+
	   +							       ** +
	   |						  (midpoint)  **  |
	   +						      |	    **	  +
       1ms +						      v	****	  +
	   +		 zfs_delay_scale ---------->	    *****	  +
	   |						 ****		  |
	   +					      ****		  +
     100us +					    **			  +
	   +					   *			  +
	   |					  *			  |
	   +					 *			  +
      10us +					 *			  +
	   +								  +
	   |								  |
	   +								  +
	   0%			 <- zfs_dirty_data_max ->		100%

     Note here that only as the	amount of dirty	data approaches	its limit does
     the delay start to	increase rapidly.  The goal of a properly tuned	system
     should be to keep the amount of dirty data	out of that range by first en-
     suring that the appropriate limits	are set	for the	I/O scheduler to reach
     optimal throughput	on the back-end	storage, and then by changing the
     value of zfs_delay_scale to increase the steepness	of the curve.

FreeBSD	13.0			 June 1, 2021			  FreeBSD 13.0


Want to link to this manual page? Use this URL:

home | help