Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
ZPOOLCONCEPTS(8)	  BSD System Manager's Manual	      ZPOOLCONCEPTS(8)

     zpoolconcepts -- overview of ZFS storage pools

   Virtual Devices (vdevs)
     A "virtual	device"	describes a single device or a collection of devices
     organized according to certain performance	and fault characteristics.
     The following virtual devices are supported:

     disk    A block device, typically located under /dev.  ZFS	can use	indi-
	     vidual slices or partitions, though the recommended mode of oper-
	     ation is to use whole disks.  A disk can be specified by a	full
	     path, or it can be	a shorthand name (the relative portion of the
	     path under	/dev).	A whole	disk can be specified by omitting the
	     slice or partition	designation.  For example, sda is equivalent
	     to	/dev/sda.  When	given a	whole disk, ZFS	automatically labels
	     the disk, if necessary.

     file    A regular file.  The use of files as a backing store is strongly
	     discouraged.  It is designed primarily for	experimental purposes,
	     as	the fault tolerance of a file is only as good as the file sys-
	     tem of which it is	a part.	 A file	must be	specified by a full

     mirror  A mirror of two or	more devices.  Data is replicated in an	iden-
	     tical fashion across all components of a mirror.  A mirror	with N
	     disks of size X can hold X	bytes and can withstand	(N-1) devices
	     failing without losing data.

     raidz, raidz1, raidz2, raidz3
	     A variation on RAID-5 that	allows for better distribution of par-
	     ity and eliminates	the RAID-5 "write hole"	(in which data and
	     parity become inconsistent	after a	power loss).  Data and parity
	     is	striped	across all disks within	a raidz	group.

	     A raidz group can have single-, double-, or triple-parity,	mean-
	     ing that the raidz	group can sustain one, two, or three failures,
	     respectively, without losing any data.  The raidz1	vdev type
	     specifies a single-parity raidz group; the	raidz2 vdev type spec-
	     ifies a double-parity raidz group;	and the	raidz3 vdev type spec-
	     ifies a triple-parity raidz group.	 The raidz vdev	type is	an
	     alias for raidz1.

	     A raidz group with	N disks	of size	X with P parity	disks can hold
	     approximately (N-P)*X bytes and can withstand P device(s) failing
	     without losing data.  The minimum number of devices in a raidz
	     group is one more than the	number of parity disks.	 The recom-
	     mended number is between 3	and 9 to help increase performance.

     draid, draid1, draid2, draid3
	     A variant of raidz	that provides integrated distributed hot
	     spares which allows for faster resilvering	while retaining	the
	     benefits of raidz.	 A dRAID vdev is constructed from multiple in-
	     ternal raidz groups, each with D data devices and P parity	de-
	     vices.  These groups are distributed over all of the children in
	     order to fully utilize the	available disk performance.

	     Unlike raidz, dRAID uses a	fixed stripe width (padding as neces-
	     sary with zeros) to allow fully sequential	resilvering.  This
	     fixed stripe width	significantly effects both usable capacity and
	     IOPS.  For	example, with the default D=8 and 4k disk sectors the
	     minimum allocation	size is	32k.  If using compression, this rela-
	     tively large allocation size can reduce the effective compression
	     ratio.  When using	ZFS volumes and	dRAID the default volblocksize
	     property is increased to account for the allocation size.	If a
	     dRAID pool	will hold a significant	amount of small	blocks,	it is
	     recommended to also add a mirrored	special	vdev to	store those

	     In	regards	to IO/s, performance is	similar	to raidz since for any
	     read all D	data disks must	be accessed.  Delivered	random IOPS
	     can be reasonably approximated as floor((N-S)/(D+P))*<single-

	     Like raidz	a dRAID	can have single-, double-, or triple-parity.
	     The draid1, draid2, and draid3 types can be used to specify the
	     parity level.  The	draid vdev type	is an alias for	draid1.

	     A dRAID with N disks of size X, D data disks per redundancy
	     group, P parity level, and	S distributed hot spares can hold ap-
	     proximately (N-S)*(D/(D+P))*X bytes and can withstand P device(s)
	     failing without losing data.

	     A non-default dRAID configuration can be specified	by appending
	     one or more of the	following optional arguments to	the draid key-

	     parity - The parity level (1-3).

	     data - The	number of data devices per redundancy group.  In gen-
	     eral a smaller value of D will increase IOPS, improve the com-
	     pression ratio, and speed up resilvering at the expense of	total
	     usable capacity.  Defaults	to 8, unless N-P-S is less than	8.

	     children -	The expected number of children.  Useful as a cross-
	     check when	listing	a large	number of devices.  An error is	re-
	     turned when the provided number of	children differs.

	     spares - The number of distributed	hot spares.  Defaults to zero.

     spare   A pseudo-vdev which keeps track of	available hot spares for a
	     pool.  For	more information, see the Hot Spares section.

     log     A separate	intent log device.  If more than one log device	is
	     specified,	then writes are	load-balanced between devices.	Log
	     devices can be mirrored.  However,	raidz vdev types are not sup-
	     ported for	the intent log.	 For more information, see the Intent
	     Log section.

     dedup   A device dedicated	solely for deduplication tables.  The redun-
	     dancy of this device should match the redundancy of the other
	     normal devices in the pool. If more than one dedup	device is
	     specified,	then allocations are load-balanced between those de-

	     A device dedicated	solely for allocating various kinds of inter-
	     nal metadata, and optionally small	file blocks.  The redundancy
	     of	this device should match the redundancy	of the other normal
	     devices in	the pool. If more than one special device is speci-
	     fied, then	allocations are	load-balanced between those devices.

	     For more information on special allocations, see the Special
	     Allocation	Class section.

     cache   A device used to cache storage pool data.	A cache	device cannot
	     be	configured as a	mirror or raidz	group.	For more information,
	     see the Cache Devices section.

     Virtual devices cannot be nested, so a mirror or raidz virtual device can
     only contain files	or disks.  Mirrors of mirrors (or other	combinations)
     are not allowed.

     A pool can	have any number	of virtual devices at the top of the configu-
     ration (known as "root vdevs").  Data is dynamically distributed across
     all top-level devices to balance data among devices.  As new virtual de-
     vices are added, ZFS automatically	places data on the newly available de-

     Virtual devices are specified one at a time on the	command	line, sepa-
     rated by whitespace.  The keywords	mirror and raidz are used to distin-
     guish where a group ends and another begins.  For example,	the following
     creates two root vdevs, each a mirror of two disks:

     # zpool create mypool mirror sda sdb mirror sdc sdd

   Device Failure and Recovery
     ZFS supports a rich set of	mechanisms for handling	device failure and
     data corruption.  All metadata and	data is	checksummed, and ZFS automati-
     cally repairs bad data from a good	copy when corruption is	detected.

     In	order to take advantage	of these features, a pool must make use	of
     some form of redundancy, using either mirrored or raidz groups.  While
     ZFS supports running in a non-redundant configuration, where each root
     vdev is simply a disk or file, this is strongly discouraged.  A single
     case of bit corruption can	render some or all of your data	unavailable.

     A pool's health status is described by one	of three states: online, de-
     graded, or	faulted.  An online pool has all devices operating normally.
     A degraded	pool is	one in which one or more devices have failed, but the
     data is still available due to a redundant	configuration.	A faulted pool
     has corrupted metadata, or	one or more faulted devices, and insufficient
     replicas to continue functioning.

     The health	of the top-level vdev, such as mirror or raidz device, is po-
     tentially impacted	by the state of	its associated vdevs, or component de-
     vices.  A top-level vdev or component device is in	one of the following

     DEGRADED  One or more top-level vdevs is in the degraded state because
	       one or more component devices are offline.  Sufficient replicas
	       exist to	continue functioning.

	       One or more component devices is	in the degraded	or faulted
	       state, but sufficient replicas exist to continue	functioning.
	       The underlying conditions are as	follows:

	       +o   The number of checksum errors exceeds acceptable levels and
		   the device is degraded as an	indication that	something may
		   be wrong.  ZFS continues to use the device as necessary.

	       +o   The number of I/O errors exceeds acceptable levels.	The
		   device could	not be marked as faulted because there are in-
		   sufficient replicas to continue functioning.

     FAULTED   One or more top-level vdevs is in the faulted state because one
	       or more component devices are offline.  Insufficient replicas
	       exist to	continue functioning.

	       One or more component devices is	in the faulted state, and in-
	       sufficient replicas exist to continue functioning.  The under-
	       lying conditions	are as follows:

	       +o   The device could be opened, but the contents	did not	match
		   expected values.

	       +o   The number of I/O errors exceeds acceptable levels and the
		   device is faulted to	prevent	further	use of the device.

     OFFLINE   The device was explicitly taken offline by the zpool offline

     ONLINE    The device is online and	functioning.

     REMOVED   The device was physically removed while the system was running.
	       Device removal detection	is hardware-dependent and may not be
	       supported on all	platforms.

     UNAVAIL   The device could	not be opened.	If a pool is imported when a
	       device was unavailable, then the	device will be identified by a
	       unique identifier instead of its	path since the path was	never
	       correct in the first place.

     If	a device is removed and	later re-attached to the system, ZFS attempts
     to	put the	device online automatically.  Device attach detection is hard-
     ware-dependent and	might not be supported on all platforms.

   Hot Spares
     ZFS allows	devices	to be associated with pools as "hot spares".  These
     devices are not actively used in the pool,	but when an active device
     fails, it is automatically	replaced by a hot spare.  To create a pool
     with hot spares, specify a	spare vdev with	any number of devices.	For

     # zpool create pool mirror	sda sdb	spare sdc sdd

     Spares can	be shared across multiple pools, and can be added with the
     zpool add command and removed with	the zpool remove command.  Once	a
     spare replacement is initiated, a new spare vdev is created within	the
     configuration that	will remain there until	the original device is re-
     placed.  At this point, the hot spare becomes available again if another
     device fails.

     If	a pool has a shared spare that is currently being used,	the pool can
     not be exported since other pools may use this shared spare, which	may
     lead to potential data corruption.

     Shared spares add some risk.  If the pools	are imported on	different
     hosts, and	both pools suffer a device failure at the same time, both
     could attempt to use the spare at the same	time.  This may	not be de-
     tected, resulting in data corruption.

     An	in-progress spare replacement can be cancelled by detaching the	hot
     spare.  If	the original faulted device is detached, then the hot spare
     assumes its place in the configuration, and is removed from the spare
     list of all active	pools.

     The draid vdev type provides distributed hot spares.  These hot spares
     are named after the dRAID vdev they're a part of (	"draid1-2-3 specifies
     spare 3 of	vdev 2,	which is a single parity dRAID"	) and may only be used
     by	that dRAID vdev.  Otherwise, they behave the same as normal hot

     Spares cannot replace log devices.

   Intent Log
     The ZFS Intent Log	(ZIL) satisfies	POSIX requirements for synchronous
     transactions.  For	instance, databases often require their	transactions
     to	be on stable storage devices when returning from a system call.	 NFS
     and other applications can	also use fsync(2) to ensure data stability.
     By	default, the intent log	is allocated from blocks within	the main pool.
     However, it might be possible to get better performance using separate
     intent log	devices	such as	NVRAM or a dedicated disk.  For	example:

     # zpool create pool sda sdb log sdc

     Multiple log devices can also be specified, and they can be mirrored.
     See the EXAMPLES section for an example of	mirroring multiple log de-

     Log devices can be	added, replaced, attached, detached and	removed.  In
     addition, log devices are imported	and exported as	part of	the pool that
     contains them.  Mirrored devices can be removed by	specifying the top-
     level mirror vdev.

   Cache Devices
     Devices can be added to a storage pool as "cache devices".	 These devices
     provide an	additional layer of caching between main memory	and disk.  For
     read-heavy	workloads, where the working set size is much larger than what
     can be cached in main memory, using cache devices allow much more of this
     working set to be served from low latency media.  Using cache devices
     provides the greatest performance improvement for random read-workloads
     of	mostly static content.

     To	create a pool with cache devices, specify a cache vdev with any	number
     of	devices.  For example:

     # zpool create pool sda sdb cache sdc sdd

     Cache devices cannot be mirrored or part of a raidz configuration.	 If a
     read error	is encountered on a cache device, that read I/O	is reissued to
     the original storage pool device, which might be part of a	mirrored or
     raidz configuration.

     The content of the	cache devices is persistent across reboots and re-
     stored asynchronously when	importing the pool in L2ARC (persistent
     L2ARC).  This can be disabled by setting l2arc_rebuild_enabled = 0.  For
     cache devices smaller than	1GB we do not write the	metadata structures
     required for rebuilding the L2ARC in order	not to waste space. This can
     be	changed	with l2arc_rebuild_blocks_min_l2size.  The cache device	header
     (512 bytes) is updated even if no metadata	structures are written.	Set-
     ting l2arc_headroom = 0 will result in scanning the full-length ARC lists
     for cacheable content to be written in L2ARC (persistent ARC). If a cache
     device is added with zpool	add its	label and header will be overwritten
     and its contents are not going to be restored in L2ARC, even if the de-
     vice was previously part of the pool. If a	cache device is	onlined	with
     zpool online its contents will be restored	in L2ARC. This is useful in
     case of memory pressure where the contents	of the cache device are	not
     fully restored in L2ARC.  The user	can off/online the cache device	when
     there is less memory pressure in order to fully restore its contents to

   Pool	checkpoint
     Before starting critical procedures that include destructive actions (e.g
     zfs destroy ), an administrator can checkpoint the	pool's state and in
     the case of a mistake or failure, rewind the entire pool back to the
     checkpoint.  Otherwise, the checkpoint can	be discarded when the proce-
     dure has completed	successfully.

     A pool checkpoint can be thought of as a pool-wide	snapshot and should be
     used with care as it contains every part of the pool's state, from	prop-
     erties to vdev configuration.  Thus, while	a pool has a checkpoint	cer-
     tain operations are not allowed.  Specifically, vdev removal/attach/de-
     tach, mirror splitting, and changing the pool's guid.  Adding a new vdev
     is	supported but in the case of a rewind it will have to be added again.
     Finally, users of this feature should keep	in mind	that scrubs in a pool
     that has a	checkpoint do not repair checkpointed data.

     To	create a checkpoint for	a pool:

     # zpool checkpoint	pool

     To	later rewind to	its checkpointed state,	you need to first export it
     and then rewind it	during import:

     # zpool export pool
     # zpool import --rewind-to-checkpoint pool

     To	discard	the checkpoint from a pool:

     # zpool checkpoint	-d pool

     Dataset reservations (controlled by the reservation or refreservation zfs
     properties) may be	unenforceable while a checkpoint exists, because the
     checkpoint	is allowed to consume the dataset's reservation.  Finally,
     data that is part of the checkpoint but has been freed in the current
     state of the pool won't be	scanned	during a scrub.

   Special Allocation Class
     The allocations in	the special class are dedicated	to specific block
     types.  By	default	this includes all metadata, the	indirect blocks	of
     user data,	and any	deduplication tables.  The class can also be provi-
     sioned to accept small file blocks.

     A pool must always	have at	least one normal (non-dedup/special) vdev be-
     fore other	devices	can be assigned	to the special class. If the special
     class becomes full, then allocations intended for it will spill back into
     the normal	class.

     Deduplication tables can be excluded from the special class by setting
     the zfs_ddt_data_is_special zfs module parameter to false (0).

     Inclusion of small	file blocks in the special class is opt-in. Each
     dataset can control the size of small file	blocks allowed in the special
     class by setting the special_small_blocks dataset property. It defaults
     to	zero, so you must opt-in by setting it to a non-zero value. See	zfs(8)
     for more info on setting this property.

BSD				August 9, 2019				   BSD


Want to link to this manual page? Use this URL:

home | help