Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
GEOM(4)		       FreeBSD Kernel Interfaces Manual		       GEOM(4)

     GEOM -- modular disk I/O request transformation framework

     The GEOM framework	provides an infrastructure in which ``classes''	can
     perform transformations on	disk I/O requests on their path	from the upper
     kernel to the device drivers and back.

     Transformations in	a GEOM context range from the simple geometric dis-
     placement performed in typical disk partitioning modules over RAID	algo-
     rithms and	device multipath resolution to full blown cryptographic	pro-
     tection of	the stored data.

     Compared to traditional ``volume management'', GEOM differs from most and
     in	some cases all previous	implementations	in the following ways:

     +o	 GEOM is extensible.  It is trivially simple to	write a	new class of
	 transformation	and it will not	be given stepchild treatment.  If
	 someone for some reason wanted	to mount IBM MVS diskpacks, a class
	 recognizing and configuring their VTOC	information would be a trivial

     +o	 GEOM is topologically agnostic.  Most volume management implementa-
	 tions have very strict	notions	of how classes can fit together, very
	 often one fixed hierarchy is provided,	for instance, subdisk -	plex -

     Being extensible means that new transformations are treated no differ-
     ently than	existing transformations.

     Fixed hierarchies are bad because they make it impossible to express the
     intent efficiently.  In the fixed hierarchy above,	it is not possible to
     mirror two	physical disks and then	partition the mirror into subdisks,
     instead one is forced to make subdisks on the physical volumes and	to
     mirror these two and two, resulting in a much more	complex	configuration.
     GEOM on the other hand does not care in which order things	are done, the
     only restriction is that cycles in	the graph will not be allowed.

     GEOM is quite object oriented and consequently the	terminology borrows a
     lot of context and	semantics from the OO vocabulary:

     A ``class'', represented by the data structure g_class implements one
     particular	kind of	transformation.	 Typical examples are MBR disk parti-
     tion, BSD disklabel, and RAID5 classes.

     An	instance of a class is called a	``geom'' and represented by the	data
     structure g_geom.	In a typical i386 FreeBSD system, there	will be	one
     geom of class MBR for each	disk.

     A ``provider'', represented by the	data structure g_provider, is the
     front gate	at which a geom	offers service.	 A provider is ``a disk-like
     thing which appears in /dev'' - a logical disk in other words.  All
     providers have three main properties: ``name'', ``sectorsize'' and

     A ``consumer'' is the backdoor through which a geom connects to another
     geom provider and through which I/O requests are sent.

     The topological relationship between these	entities are as	follows:

     +o	 A class has zero or more geom instances.

     +o	 A geom	has exactly one	class it is derived from.

     +o	 A geom	has zero or more consumers.

     +o	 A geom	has zero or more providers.

     +o	 A consumer can	be attached to zero or one providers.

     +o	 A provider can	have zero or more consumers attached.

     All geoms have a rank-number assigned, which is used to detect and	pre-
     vent loops	in the acyclic directed	graph.	This rank number is assigned
     as	follows:

     1.	  A geom with no attached consumers has	rank=1.

     2.	  A geom with attached consumers has a rank one	higher than the	high-
	  est rank of the geoms	of the providers its consumers are attached

     In	addition to the	straightforward	attach,	which attaches a consumer to a
     provider, and detach, which breaks	the bond, a number of special topolog-
     ical maneuvers exists to facilitate configuration and to improve the
     overall flexibility.

     TASTING is	a process that happens whenever	a new class or new provider is
     created, and it provides the class	a chance to automatically configure an
     instance on providers which it recognizes as its own.  A typical example
     is	the MBR	disk-partition class which will	look for the MBR table in the
     first sector and, if found	and validated, will instantiate	a geom to mul-
     tiplex according to the contents of the MBR.

     A new class will be offered to all	existing providers in turn and a new
     provider will be offered to all classes in	turn.

     Exactly what a class does to recognize if it should accept	the offered
     provider is not defined by	GEOM, but the sensible set of options are:

     +o	 Examine specific data structures on the disk.

     +o	 Examine properties like ``sectorsize''	or ``mediasize'' for the

     +o	 Examine the rank number of the	provider's geom.

     +o	 Examine the method name of the	provider's geom.

     ORPHANIZATION is the process by which a provider is removed while it
     potentially is still being	used.

     When a geom orphans a provider, all future	I/O requests will ``bounce''
     on	the provider with an error code	set by the geom.  Any consumers
     attached to the provider will receive notification	about the orphaniza-
     tion when the event loop gets around to it, and they can take appropriate
     action at that time.

     A geom which came into being as a result of a normal taste	operation
     should self-destruct unless it has	a way to keep functioning whilst lack-
     ing the orphaned provider.	 Geoms like disk slicers should	therefore
     self-destruct whereas RAID5 or mirror geoms will be able to continue as
     long as they do not lose quorum.

     When a provider is	orphaned, this does not	necessarily result in any
     immediate change in the topology: any attached consumers are still
     attached, any opened paths	are still open,	any outstanding	I/O requests
     are still outstanding.

     The typical scenario is:

	   +o   A device	driver detects a disk has departed and orphans the
	       provider	for it.
	   +o   The geoms on top	of the disk receive the	orphanization event
	       and orphan all their providers in turn.	Providers which	are
	       not attached to will typically self-destruct right away.	 This
	       process continues in a quasi-recursive fashion until all	rele-
	       vant pieces of the tree have heard the bad news.
	   +o   Eventually the buck stops when it reaches geom_dev at the top
	       of the stack.
	   +o   Geom_dev	will call destroy_dev(9) to stop any more requests
	       from coming in.	It will	sleep until any	and all	outstanding
	       I/O requests have been returned.	 It will explicitly close
	       (i.e.: zero the access counts), a change	which will propagate
	       all the way down	through	the mesh.  It will then	detach and
	       destroy its geom.
	   +o   The geom	whose provider is now detached will destroy the
	       provider, detach	and destroy its	consumer and destroy its geom.
	   +o   This process percolates all the way down	through	the mesh,
	       until the cleanup is complete.

     While this	approach seems byzantine, it does provide the maximum flexi-
     bility and	robustness in handling disappearing devices.

     The one absolutely	crucial	detail to be aware of is that if the device
     driver does not return all	I/O requests, the tree will not	unravel.

     SPOILING is a special case	of orphanization used to protect against stale
     metadata.	It is probably easiest to understand spoiling by going through
     an	example.

     Imagine a disk, da0, on top of which an MBR geom provides da0s1 and
     da0s2, and	on top of da0s1	a BSD geom provides da0s1a through da0s1e, and
     that both the MBR and BSD geoms have autoconfigured based on data struc-
     tures on the disk media.  Now imagine the case where da0 is opened	for
     writing and those data structures are modified or overwritten: now	the
     geoms would be operating on stale metadata	unless some notification sys-
     tem can inform them otherwise.

     To	avoid this situation, when the open of da0 for write happens, all
     attached consumers	are told about this and	geoms like MBR and BSD will
     self-destruct as a	result.	 When da0 is closed, it	will be	offered	for
     tasting again and,	if the data structures for MBR and BSD are still
     there, new	geoms will instantiate themselves anew.

     Now for the fine print:

     If	any of the paths through the MBR or BSD	module were open, they would
     have opened downwards with	an exclusive bit thus rendering	it impossible
     to	open da0 for writing in	that case.  Conversely,	the requested exclu-
     sive bit would render it impossible to open a path	through	the MBR	geom
     while da0 is open for writing.

     From this it also follows that changing the size of open geoms can	only
     be	done with their	cooperation.

     Finally: the spoiling only	happens	when the write count goes from zero to
     non-zero and the retasting	happens	only when the write count goes from
     non-zero to zero.

     INSERT/DELETE are very special operations which allow a new geom to be
     instantiated between a consumer and a provider attached to	each other and
     to	remove it again.

     To	understand the utility of this,	imagine	a provider being mounted as a
     file system.  Between the DEVFS geom's consumer and its provider we
     insert a mirror module which configures itself with one mirror copy and
     consequently is transparent to the	I/O requests on	the path.  We can now
     configure yet a mirror copy on the	mirror geom, request a synchroniza-
     tion, and finally drop the	first mirror copy.  We have now, in essence,
     moved a mounted file system from one disk to another while	it was being
     used.  At this point the mirror geom can be deleted from the path again;
     it	has served its purpose.

     CONFIGURE is the process where the	administrator issues instructions for
     a particular class	to instantiate itself.	There are multiple ways	to
     express intent in this case - a particular	provider may be	specified with
     a level of	override forcing, for instance,	a BSD disklabel	module to
     attach to a provider which	was not	found palatable	during the TASTE oper-

     Finally, I/O is the reason	we even	do this: it concerns itself with send-
     ing I/O requests through the graph.

     I/O REQUESTS, represented by struct bio, originate	at a consumer, are
     scheduled on its attached provider	and, when processed, are returned to
     the consumer.  It is important to realize that the	struct bio which
     enters through the	provider of a particular geom does not ``come out on
     the other side''.	Even simple transformations like MBR and BSD will
     clone the struct bio, modify the clone, and schedule the clone on their
     own consumer.  Note that cloning the struct bio does not involve cloning
     the actual	data area specified in the I/O request.

     In	total, four different I/O requests exist in GEOM: read,	write, delete,
     and ``get attribute''.

     Read and write are	self explanatory.

     Delete indicates that a certain range of data is no longer	used and that
     it	can be erased or freed as the underlying technology supports.  Tech-
     nologies like flash adaptation layers can arrange to erase	the relevant
     blocks before they	will become reassigned and cryptographic devices may
     want to fill random bits into the range to	reduce the amount of data
     available for attack.

     It	is important to	recognize that a delete	indication is not a request
     and consequently there is no guarantee that the data actually will	be
     erased or made unavailable	unless guaranteed by specific geoms in the
     graph.  If	``secure delete'' semantics are	required, a geom should	be
     pushed which converts delete indications into (a sequence of) write

     ``Get attribute'' supports	inspection and manipulation of out-of-band
     attributes	on a particular	provider or path.  Attributes are named	by
     ASCII strings and they will be discussed in a separate section below.

     (Stay tuned while the author rests	his brain and fingers: more to come.)

     Several flags are provided	for tracing GEOM operations and	unlocking pro-
     tection mechanisms	via the	kern.geom.debugflags sysctl.  All of these
     flags are off by default, and great care should be	taken in turning them

     0x01 (G_T_TOPOLOGY)
	     Provide tracing of	topology change	events.

     0x02 (G_T_BIO)
	     Provide tracing of	buffer I/O requests.

     0x04 (G_T_ACCESS)
	     Provide tracing of	access check controls.

     0x08 (unused)

     0x10 (allow foot shooting)
	     Allow writing to Rank 1 providers.	 This would, for example,
	     allow the super-user to overwrite the MBR on the root disk	or
	     write random sectors elsewhere to a mounted disk.	The implica-
	     tions are obvious.

     0x40 (G_F_DISKIOCTL)
	     This is unused at this time.

     0x80 (G_F_CTLDUMP)
	     Dump contents of gctl requests.

     libgeom(3), disk(9), DECLARE_GEOM_CLASS(9), g_access(9), g_attach(9),
     g_bio(9), g_consumer(9), g_data(9), g_event(9), g_geom(9),	g_provider(9),

     This software was developed for the FreeBSD Project by Poul-Henning Kamp
     and NAI Labs, the Security	Research Division of Network Associates, Inc.
     under DARPA/SPAWAR	contract N66001-01-C-8035 (``CBOSS''), as part of the
     DARPA CHATS research program.

     The first precursor for GEOM was a	gruesome hack to Minix 1.2 and was
     never distributed.	 An earlier attempt to implement a less	general	scheme
     in	FreeBSD	never succeeded.

     Poul-Henning Kamp <>

FreeBSD	9.1			 May 25, 2006			   FreeBSD 9.1


Want to link to this manual page? Use this URL:

home | help