Skip site navigation (1)Skip section navigation (2)

FreeBSD Manual Pages


home | help
GEOM(4)			 BSD Kernel Interfaces Manual		       GEOM(4)

     GEOM -- modular disk I/O request transformation framework.

     The GEOM framework	provides an infrastructure in which modules can	per-
     form transformations on disk I/O requests on their	path from the upper
     kernel to the device drivers and back.

     Transformations in	a GEOM context range from the simple geometric dis-
     placement performed in typical disklabel modules over RAID	algorithms and
     device multipath resolution to full blown cryptographic protection	of the
     stored data.

     Compared to traditional "volume management", GEOM differs from most and
     in	some cases all previous	implementations	in the following ways:

     +o	 GEOM is extensible.  It is trivially simple to	write a	new class of
	 transformation	and it will not	be given stepchild treatment.  If
	 someone for some reason wanted	to mount IBM MVS diskpacks, a class
	 recognizing and configuring their VTOC	information would be a trivial

     +o	 GEOM is topologically agnostic.  Most volume management implementa-
	 tions have very strict	notions	of how classes can fit together, very
	 often one fixed hierarchy is provided for instance  subdisk - plex -

     Being extensible means that new transformations are treated no differ-
     ently than	existing transformations.

     Fixed hierarchies are bad because they make it impossible to express the
     intent efficiently.  In the fixed hierarchy above it is not possible to
     mirror two	physical disks and then	partition the mirror into subdisks,
     instead one is forced to make subdisks on the physical volumes and	to
     mirror these two and two resulting	in a much more complex configuration.
     GEOM on the other hand does not care in which order things	are done, the
     only restriction is that cycles in	the graph will not be allowed.

     Geom is quite object oriented and consequently the	terminology borrows a
     lot of context and	semantics from the OO vocabulary:

     A "class",	represented by the data	structure g_class implements one par-
     ticular kind of transformation.  Typical examples are MBR disk partition,
     BSD disklabel, and	RAID5 classes.

     An	instance of a class is called a	"geom" and represented by the data
     structure "g_geom".  In a typical i386 FreeBSD system, there will be one
     geom of class MBR for each	disk.

     A "provider", represented by the data structure "g_provider", is the
     front gate	at which a geom	offers service.	 A provider is "a disk-like
     thing which appears in /dev" - a logical disk in other words.  All
     providers have three main properties: name, sectorsize and	size.

     A "consumer" is the backdoor through which	a geom connects	to another
     geom provider and through which I/O requests are sent.

     The topological relationship between these	entities are as	follows:

     +o	 A class has zero or more geom instances.

     +o	 A geom	has exactly one	class it is derived from.

     +o	 A geom	has zero or more consumers.

     +o	 A geom	has zero or more providers.

     +o	 A consumer can	be attached to zero or one providers.

     +o	 A provider can	have zero or more consumers attached.

     All geoms have a rank-number assigned, which is used to detect and	pre-
     vent loops	in the acyclic directed	graph.	This rank number is assigned
     as	follows:

     1.	  A geom with no attached consumers has	rank=1

     2.	  A geom with attached consumers has a rank one	higher than the	high-
	  est rank of the geoms	of the providers its consumers are attached

     In	addition to the	straightforward	attach,	which attaches a consumer to a
     provider, and dettach, which breaks the bond, a number of special toplog-
     ical maneuvres exists to facilitate configuration and to improve the
     overall flexibility.

     TASTING is	a process that happens whenever	a new class or new provider is
     created and it is the class' chance to automatically configure an in-
     stance on providers, which	it recognize as	its own.  A typical example is
     the MBR disk-partition class which	will look for the MBR table in the
     first sector and if found and validated it	will instantiate a geom	to
     multiplex according to the	contents of the	MBR.

     A new class will be offered to all	existing providers in turn and a new
     provider will be offered to all classes in	turn.

     Exactly what a class does to recognize if it should accept	the offered
     provider is not defined by	GEOM, but the sensible set of options are:

     +o	 Examine specific data structures on the disk.

     +o	 Examine properties like sectorsize or mediasize for the provider.

     +o	 Examine the rank number of the	provider's geom.

     +o	 Examine the method name of the	provider's geom.

     ORPHANIZATION is the process by which a provider is removed while it po-
     tentially is still	being used.

     When a geom makes a provider an orphan, all future	I/O requests will
     "bounce" on the provider with an error code set by	the geom.  Any con-
     sumers attached to	the provider will receive notification about the orph-
     anization and need	to take	appropriate action.

     A geom which came into being as a result of a normal taste	operation
     should selfdestruct unless	it has a way to	keep functioning.  Geoms like
     disklabels	and stripes should therefore selfdestruct whereas RAID5	or
     mirror geoms can continue to function as long as they do not loose	quo-

     When a provider is	orphaned, this does not	result in any immediate	change
     in	the topology, any attached consumers are still attached, any opened
     paths are still open, it is the responsibility of the geoms above to
     close and dettach as soon as this can happen.

     The typical scenario is that a device driver notices a disk has gone and
     orphans the provider for it.  The geoms on	top receive the	orphanization
     event and orphan all their	providers in turn.  Providers, which are not
     attached, are destroyed right away.  Eventually at	the toplevel the geom
     which interfaces to the DEVFS received an orphan event on its consumer
     and it calls destroy_dev(9) and does an explicit close if the device was
     open and then dettaches its consumer.  The	provider below is now no
     longer attached to	and can	be destroyed, if the geom has no more
     providers it can dettach its consumer and selfdestruct and	so the carnage
     passes back down the tree,	until the original provider is dettached from
     and it can	be destroyed by	the geom serving the device driver.

     While this	approach seems byzantine, it does provide the maximum flexi-
     bility in handling	disappearing devices.

     SPOILING is a special case	of orphanization used to protect against stale
     metadata.	It is probably easiest to understand spoiling by going through
     an	example.

     Imagine a disk, "da0" on top of which a MBR geom provides "da0s1" and
     "da0s2" and on top	of "da0s1" a BSD geom provides "da0s1a"	through
     "da0s1e", both the	MBR and	BSD geoms have autoconfigured based on data
     structures	on the disk media.  Now	imagine	the case where "da0" is	opened
     for writing and those data	structures are modified	or overwritten:	 Now
     the geoms would be	operating on stale metadata unless some	notification
     system can	inform them otherwise.	To avoid this situation, when the open
     of	"da0" for write	happens, all attached consumers	are told about this,
     and geoms like MBR	and BSD	will selfdestruct as a result.	When "da0" is
     closed again, it will be offered for tasting again	and if the data	struc-
     tures for MBR and BSD are still there, new	geoms will instantiate them-
     selves anew.

     Now for the fine print:

     If	any of the paths through the MBR or BSD	module were open, they would
     have opened downwards with	an exclusive bit rendering it impossible to
     open "da0"	for writing in that case and conversely	the requested exclu-
     sive bit would render it impossible to open a path	through	the MBR	geom
     while "da0" is open for writing.

     From this it also follows that changing the size of open geoms can	only
     be	done through their cooperation.

     Finally: the spoiling only	happens	when the write count goes from zero to
     non-zero and the retasting	only when the write count goes back to zero.

     INSERT/DELETE are a very special operation	which allows a new geom	to be
     instantiated between a consumer and a provider attached to	each other and
     to	remove it again.

     To	understand the utility of this,	imagine	a provider with	being mounted
     as	a file system.	Between	the DEVFS geoms	consumer and its provider we
     insert a mirror module which configures itself with one mirror copy and
     consequently is transparent to the	I/O requests on	the path.  We can now
     configure yet a mirror copy on the	mirror geom, request a synchroniza-
     tion, and finally drop the	first mirror copy.  We have now	in essence
     moved a mounted file system from one disk to another while	it was being
     used.  At this point the mirror geom can be deleted from the path again,
     it	has served its purpose.

     CONFIGURE is the process where the	administrator issues instructions for
     a particular class	to instantiate itself.	There are multiple ways	to ex-
     press intent in this case,	a particular provider can be specified with a
     level of override forcing for instance a BSD disklabel module to attach
     to	a provider which was not found palatable during	the TASTE operation.

     Finally IO	is the reason we even do this: it concerns itself with sending
     I/O requests through the graph.

     I/O REQUESTS represented by struct	bio, originate at a consumer, are
     scheduled on its attached provider, and when processed, returned to the
     consumer.	It is important	to realize that	the struct bio which enters
     throuh the	provider of a particular geom does not "come out on the	other
     side".  Even simple transformations like MBR and BSD will clone the
     struct bio, modify	the clone, and schedule	the clone on their own con-
     sumer.  Note that cloning the struct bio does not involve cloning the ac-
     tual data area specified in the IO	request.

     In	total five different IO	requests exist in GEOM:	read, write, delete,
     format, get attribute, and	set attribute.

     Read and write are	self explanatory.

     Delete indicates that a certain range of data is no longer	used and that
     it	can be erased or freed as the underlying technology supports.  Tech-
     nologies like flash adaptation layers can arrange to erase	the relevant
     blocks before they	will become reassigned and cryptographic devices may
     want to fill random bits into the range to	reduce the amount of data
     available for attack.

     It	is important to	recognize that a delete	indication is not a request
     and consequently there is no guarantee that the data actually will	be
     erased or made unavailable	unless guaranteed by specific geoms in the
     graph.  If	"secure	delete"	semantics are required,	a geom should be
     pushed which converts delete indications into (a sequence of) write re-

     Get attribute and set attribute supports inspection and manipulation of
     out-of-band attributes on a particular provider or	path.  Attributes are
     named by ascii strings and	they will be discussed in a separate section

     (stay tuned while the author rests	his brain and fingers: more to come.)

     This software was developed for the FreeBSD Project by Poul-Henning Kamp
     and NAI Labs, the Security	Research Division of Network Associates, Inc.
     under DARPA/SPAWAR	contract N66001-01-C-8035 ("CBOSS"), as	part of	the
     DARPA CHATS research program.

     The first precursor for GEOM was a	gruesome hack to Minix 1.2 and was
     never distributed.	 An earlier attempt to implement a less	general	scheme
     in	FreeBSD	never succeeded.

     Poul-Henning Kamp <>

BSD				March 27, 2002				   BSD


Want to link to this manual page? Use this URL:

home | help