FreeBSD The Power to Serve

NVMe over Fabrics

Contact: John Baldwin <jhb@FreeBSD.org>

NVMe over Fabrics enables communication with a storage device using the NVMe protocol over a network fabric. This is similar to using iSCSI to export a storage device over a network using SCSI commands.

NVMe over Fabrics currently defines network transports for Fibre Channel, RDMA, and TCP.

The work in the nvmf2 branch includes a userland library (lib/libnvmf) which contains an abstraction for transports and an implementation of a TCP transport. It also includes changes to nvmecontrol(8) to add 'discover', 'connect', and 'disconnect' commands to manage connections to a remote controller.

The branch also contains an in-kernel Fabrics implementation. nvmf_transport.ko contains a transport abstraction that sits in between the nvmf host (initiator in SCSI terms) and the individual transports. nvmf_tcp.ko contains an implementation of the TCP transport layer. nvmf.ko contains an NVMe over Fabrics host (initiator) which connects to a remote controller and exports remote namespaces as disk devices. Similar to the nvme(4) driver for NVMe over PCI-express, namespaces are exported via /dev/nvmeXnsY devices which only support simple operations, but are also exported as ndaX disk devices via CAM. Unlike nvme(4), nvmf(4) does not support the nvd(4) disk driver. nvmecontrol(8) can be used with remote namespaces and remote controllers, for example to fetch log pages, display identify info, etc.

Note that nvmf(4) is currently a bit simple and some error cases are still a TODO. If an error occurs, the queues (and backing network connections) are dropped, but the devices stay around, with I/O requests paused. nvmecontrol reconnect can be used to connect a new set of network connections to resume operation. Unlike iSCSI which uses a persistent daemon (iscsid(8)) to reconnect after an error, reconnections must be made manually.

The current code is very new and likely not robust. It is certainly not ready for production use. Experienced users who do not mind all their data vanishing in a puff of smoke after a kernel panic, and who have an interest in NVMe over Fabrics, can start testing it at their own risk.

The next main task is to implement a Fabrics controller (target in SCSI language). Probably a simple one in userland first followed by a "real" one that offloads the data handling to the kernel but is somewhat integrated with ctld(8) so that individual disk devices can be exported via iSCSI or NVMe, or via both using a single config file and daemon to manage all of that. This may require a fair bit of refactoring in ctld to make it less iSCSI-specific. Working on the controller side will also validate some of the currently under-tested API design decisions in the transport-independent layer. I think it probably does not make sense to merge any of the NVMe over Fabrics changes into the tree until after this step.

Sponsored by: Chelsio Communications


Last modified on: July 13, 2023 by Graham Perrin