7. Verbs Interface¶
rdma.ibverbs
implements a set of Python extension objects and
functions that provide a wrapper around the OFA verbs interface from
libibverbs. The wrapper puts the verbs interface into an OOP methodology and
generally exposes most functionality to Python.
A basic example for getting a verbs instance and a protection domain is:
import rdma
import rdma.ibverbs as ibv
end_port = rdma.get_end_port()
with rdma.get_verbs(end_port) as ctx:
pd = ctx.pd();
Verbs objects that have an underlying kernel allocation are all context
managers and have a close()
method, but the objects also keep track of
their own children. Ie closing a rdma.ibverbs.Context
will close all
rdma.ibverbs.PD
and rdma.ibverbs.CQ
objects created by it.
This makes resource clean up quite straightforward in most cases.
Like with file
objects users should be careful to call the
close()
method once the instance is no longer needed. Generally
focusing on the rdma.ibverbs.Context
and rdma.ibverbs.PD
is sufficient due to the built in resource clean up.
The IB verbs structures (eg ibv_qp_attr) are mapped into Python objects, (eg
rdma.ibverbs.ibv_qp_attr
). As Python objects they work similarly to the C
syntax with structure member assignment, but they can also be initialized with
a keyword argument list to the constructor. This can save a considerable number of lines.
There are efficient wrapper functions that create
qp_attr
, ah_attr
and
sge
objects with a reduced number of arguments.
Errors from verbs are raised as a rdma.SysError
which includes
the libibverb function that failed and the associated errno.
Note
Despite the name ‘ibverbs’ the verbs interface is a generic interface that is supported by all RDMA devices. Different technologies have various limitations and support for anything but IB through this library is not completed.
7.1. Verbs and rdma.path.IBPath
¶
The raw verbs interface for creating QPs is simplified to rely on the standard
IBPath
structure which should be filled in with all the
necessary parameters. The wrapper QP modify methods
modify_to_init()
,
modify_to_rtr()
, and
modify_to_rts()
can setup a QP without additional
information.
The attributes in an IBPath
are used as follows when modifying
a QP:
Path Attribute | Usage |
---|---|
end_port.port_id | qp_attr.port_num |
pkey | qp_attr.pkey_index |
qkey | qp_attr.qkey |
MTU | qp_attr.path_mtu |
retries | qp_attr.retry_cnt |
min_rnr_timer | qp_attr.min_rnr_timer |
packet_life_time | qp_attr.timeout |
dack_resp_time | qp_attr.timeout |
sack_resp_time | |
dqpn | qp_attr.dest_qp_num |
sqpn | |
dqpsn | qp_attr.rq_psn |
sqpsn | qp_attr.sq_psn |
drdatomic | qp_attr.max_dest_rd_atomic |
srdatomic | qp_attr.max_rd_atomic |
IBPath
structures can also be used any place where an
ah_attr
could be used, including for creating
AH
instances and with
modify()
. With this usage the
IBPath
caches the created AH, so getting the AH for a path
the second time does not rebuild the AH. This means callers generally don’t
have to worry about creating and maintaining AH’s explicitly.
The attributes in an IBPath
are used as follows when creating
an AH:
Path Attribute | Usage |
---|---|
has_grh | ah_attr.is_global |
DGID | ah_attr.grh.dgid |
SGID | ah_attr.grh.sgid_index |
flow_label | ah_attr.grh.flow_label |
hop_limit | ah_attr.grh.hop_limit |
traffic_class | ah_attr.grh.traffic_class |
DLID | ah_attr.dlid |
SLID | ah_attr.SLID_bits |
SL | ah_attr.SL |
rate | ah_attr.static_rate |
end_port.port_id | ah_attr.port_num |
7.2. Usage Examples¶
This is not intended to be a verbs primer. Generally the API follows that of the normal OFA verbs (with ibv_ prefixes removed) , which in turn follows the API documented by the IBA specification. Many helper functions are provided to handle common situations in a standard way, generally these are preferred.
7.2.1. UD QP Setup¶
Setting up a QP for UD communication is very simple. There are two major cases, for communication with a single end port, and for communication with multiple. The single case is:
path = IBPath(end_port,dpqn=1,qkey=IBA.IB_DEFAULT_QP1_QKEY,DGID=...);
with rdma.get_gmp_mad(path.end_port,verbs=ctx) as umad:
rdma.path.resolve_path(umad,path,reversible=True);
with ctx.pd() as pd:
depth = 16;
cq = pd.cq(2*depth);
qp = pd.qp(ibv.IBV_QPT_UD,depth,cq,depth,cq)
path.sqpn = qp.qp_num;
# Post receive work requests to qp here
qp.establish(path);
qp.post_send(ibv.send_wr(opcode=ibv.IBV_WR_SEND,
send_flags=ibv.IBV_SEND_SIGNALED,
ah=pd.ah(path),
remote_qpn=path.dpqn,
remote_qkey=path.qkey,
...));
Notice that the path is used to configure the pkey and qkey values of the UD QP during initialization, and is also used to create the AH for the send work request.
The case for multiple destinations is very similar, however all destinations must share the same PKey and QKey. For instance, assuming there is a list of DGIDs:
with rdma.get_gmp_mad(path.end_port) as umad:
paths = [rdma.path.resolve_path(umad,IBPath(end_port,DGID=I,
qkey=IBA.IB_DEFAULT_QP1_QKEY),
reversible=True,
properties={'PKey': IBA.DEFAULT_PKEY})
for I in destinations];
Will resolve all the DGIDs into paths with the same QKey and PKey. paths[-1] can be used to setup the QP and all the paths can be used interchangeably in work requests.
7.2.2. UD response path¶
Constructing the reply path and generating a send WR from a UD WC is very straightforward:
wcs = cq.poll():
for wc in wcs:
path = ibv.WCPath(self.end_port,wc,
buf,0,
pkey=qp_pkey,
qkey=qp_qkey);
path.reverse();
ah = pd.ah(path);
wr = ibv.send_wr(opcode=ibv.IBV_WR_SEND,
ah=ah,
remote_qpn=path.dpqn,
remote_qkey=path.qkey,
...);
buf,0 is the buffer and offset of the memory posted in the recv
request. Remember that on UD QPs the first 40 bytes of the receive buffer are
reserved for a GRH, which is accessed by rdma.ibverbs.WCPath()
.
7.2.3. No CM QP Setup¶
The library has built in support for correctly establishing IB connections without using a CM by exchanging information over a side channel (eg a TCP socket). Side A would do this:
qp = pd.qp(ibv.IBV_QPT_RC,...);
path = rdma.path.IBPath(end_port,SGID=end_port.default_gid);
rdma.path.fill_path(qp,path);
path.reverse(for_reply=False);
send_to_side_b(pickle.pickle(path));
path = pickle.unpickle(recv_from_side_b());
path.reverse(for_reply=False);
path.set_end_port(end_port.parent);
qp.establish(self.path.forward_path,ibv.IBV_ACCESS_REMOTE_WRITE);
# Synchronize transition to RTS
send_to_side_b(True);
recv_from_side_b();
Side B would do this:
qp = pd.qp(ibv.IBV_QPT_RC,...);
path = pickle.unpickle(recv_from_side_a());
path.end_port = end_port;
rdma.path.fill_path(qp,path);
with rdma.get_gmp_mad(path.end_port) as umad:
rdma.path.resolve_path(umad,path);
send_to_side_a(pickle.pickle(path));
qp.establish(self.path.forward_path,ibv.IBV_ACCESS_REMOTE_WRITE);
# Synchronize transition to RTS
recv_from_side_a();
send_to_side_a(True);
rdma.path.fill_path()
sets up most of the the QP related path parameters
and rdma.path.resolve_path()
gets the path record(s) from the SA.
This procedure implements the same process and information exchange that the normal IB CM would do, including negotiating responder resources and having the capability to setup asymmetric paths.
Any QP type is supported by this basic procedure, the extra information exchanged is simply not used.
Note
Pickle is only used as an easy example here. Real cases should do something
else as unpickling untrusted data is dangerous. The
Path
object has a __reduce__()
method which can be used to implement a protocol appropriate encoding.
7.2.4. WC Error handling¶
The class rdma.ibverbs.WCError
is an exception that can be thrown
when a WC error is detected. It formats the information in the WC and provides
a way for the catcher to determine the failed QP:
wcs = cq.poll():
for wc in wcs:
if wc.status != ibv.IBV_WC_SUCCESS:
raise ibv.WCError(wc,cq);
Depending on the situation QP errors may not be recoverable so the whole QP should be torn down.
7.2.5. Completion Channels¶
Additional helpers are provided to simplify completion channel processing, suitable for single threaded applications. The basic usage for a completion channel is:
# To setup the completion channel
cc = ctx.comp_channel();
poll = select.poll();
cc.register_poll(poll);
cq = ctx.cq(2*depth,cc)
def get_wcs():
cq.req_notify();
while True:
ret = poll.poll();
for I in ret:
if cc.check_poll(I) is not None:
wcs = cq.poll();
if wcs is not None:
return wcs;
wcs = get_wcs();
Obviously the methodology becomes more complex if additional things are polled
for. The basic idea is that rdma.ibverbs.CompChannel.check_poll()
takes
care of all the details and returns the CQ that has available work
completions.
Using CQPoller
the above example can be further simplified:
cc = ctx.comp_channel();
cq = ctx.cq(2*depth,cc)
poller = rdma.vtools.CQPoller(cq);
for wc in poller.iterwc(timeout=1):
print wc
CQPoller
also monitors for asynchronous events and will
call rdma.ibverbs.Context.handle_async_event()
which will produce exceptions
for failure conditions and update the end port cache as necessary.
7.2.6. Memory¶
Memory registrations are made explicit, as with verbs everything that is passed into a work request must have an associated memory registration. A MR object can be created for anything that supports the Python buffer protocol, and writable MRs require a mutable Python buffer. Some useful examples:
s = "Hello";
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_READ);
s = bytearray(256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);
s = mmap.mmap(-1,256);
mr = pd.mr(s,ibv.IBV_ACCESS_REMOTE_WRITE);
SGEs are constructed through the MR:
sge = mr.sge();
sge = mr.sge(length=128,off=10);
A tool is provided for managing a finite pool of fixed size buffers. This construct is very useful for applications using the SEND verb:
pool = rdma.vtools.BufferPool(pd,count=100,size=1024);
pool.post_recvs(qp,50);
buf_idx = pool.pop();
pool.copy_to("Hello message!",buf_idx);
qp.post_send(pool.make_send_wr(buf_idx,pool.size,path));
7.3. rdma.vtools
module¶
rdma.vtools
provides various support functions to make verbs
programming easier.
-
class
rdma.vtools.
BufferPool
(pd, count, size)¶ Bases:
object
Hold onto a block of fixed size buffers and provide some helpers for using them as send and receive buffers with a QP.
This can be used to provide send buffers for a QP, as well as receive buffers for a QP or a SRQ. Generally the qp argument to methods of this class can be a
rdma.ibverbs.QP
orrdma.ibverbs.SRQ
.A
rdma.ibverbs.MR
is created in pd with count buffers of size bytes.-
BUF_ID_MASK
= 0¶ Mask to convert a wr_id back into a buf_idx.
-
NO_WR_ID
= 4294967295¶ Constant value to set wr_id to when it is not being used.
-
RECV_FLAG
= 0¶ Constant value to or into wr_id to indicate it was posted as a recv.
-
close
()¶ Close held objects
-
copy_from
(buf_idx, offset=0, length=4294967295)¶ Return a copy of buffer buf_idx. buf_idx may be a wr_id.
Return type: bytearray
-
copy_to
(buf, buf_idx, offset=0, length=4294967295)¶ Copy buf into the buffer buf_idx
-
count
= 0¶ Number of buffers.
-
finish_wcs
(qp, wcs)¶ Process work completion list wcs to recover buffers attached to completed work and re-post recv buffers to qp. Every work request with an attached buffer must have a signaled completion to recover the buffer.
wcs may be a single wc.
Raises: rdma.ibverbs.WCError – For WC’s marked as error.
-
make_send_wr
(buf_idx, buf_len, path=None)¶ Return a
rdma.ibverbs.send_wr
for buf_idx and path. If path is None then the wr does not contain path information (eg for connected QPs)
-
make_sge
(buf_idx, buf_len)¶ Return a
rdma.ibverbs.SGE
for buf_idx.
-
pop
()¶ Return a new buffer index.
-
post_recvs
(qp, count)¶ Post count buffers for receive to qp, which may be any object with a post_recv method.
-
size
= 0¶ Size of a single buffer.
-
-
class
rdma.vtools.
CQPoller
(cq, async_events=True, solicited_only=False)¶ Bases:
object
Simple wrapper for a
rdma.ibverbs.CQ
andrdma.ibverbs.CompChannel
to provide a blocking API for getting work completions.cq is the completion queue to read work completions from. If the cq does not have a completion channel then this will spin loop on cq otherwise it sleeps on the completion channel.
If async_events is True then the async event queue will be monitored while sleeping.
-
iterwc
(count=None, timeout=None, wakeat=None)¶ Generator that returns work completions from the CQ. If not None at most count wcs will be returned. timeout is the number of seconds this function can run for, and wakeat is the value of
rdma.tools.clock_monotonic()
after which iteration stops.Return type: rdma.ibverbs.wc
-
sleep
(wakeat)¶ Go to sleep until the cq gets a completion. wakeat is the value of
rdma.tools.clock_monotonic()
after which the function returns None. Returns True if the completion channel triggered.If no completion channel is in use this just returns True.
Note: It is necessary to call
rdma.ibverbs.CQ.req_notify()
on the CQ, then poll the CQ before callingsleep()
. Otherwise the edge triggered nature of the completion channels can cause deadlock.
-
timedout
= False¶ True if iteration was stopped due to a timeout
-
wakeat
= None¶ Value of
rdma.tools.clock_monotonic()
to stop iterating. This can be altered while iterating.
-
7.4. rdma.ibverbs
module¶
Note
Unfortunately Sphinx does not do a very good job auto documenting extension modules, and all the function arguments are stripped out. Until this is resolved the documentation after this point is incomplete.
The rdma.ibverbs
module wrappers all of the functions in libibverbs
that are not duplicated elsewhere in the library, for instance, device
discovery uses the rdma.devices
module, not the functions from
libibverbs.
-
class
rdma.ibverbs.
AH
¶ Bases:
object
Address handle, this is a context manager.
-
close
()¶ Free the verbs AH handle.
-
-
exception
rdma.ibverbs.
AsyncError
¶ Bases:
rdma.RDMAError
Raised when an asynchronous error event is received.
-
class
rdma.ibverbs.
CQ
¶ Bases:
object
Completion queue, this is a context manager.
-
close
()¶ Free the verbs CQ handle.
-
comp_chan
¶
-
comp_events
¶
-
ctx
¶
-
poll
()¶ Perform the poll_cq operation, return a list of work requests.
-
req_notify
()¶ Request event notification for CQEs added to the CQ.
-
resize
()¶ Resize the CQ to have at least cqes entries.
-
-
class
rdma.ibverbs.
CompChannel
¶ Bases:
object
Completion channel, this is a context manager.
-
check_poll
()¶ Returns a
rdma.ibverbs.CQ
that got at least one completion event, or None. This updates the comp channel and keeps track of received events, and appropriately calls ibv_ack_cq_events internally. After this call the CQ must be re-armed viardma.ibverbs.CQ.req_notify()
-
close
()¶ Free the verbs completion channel handle.
-
ctx
¶
-
fileno
()¶ Return the FD associated with this completion channel.
-
register_poll
()¶ Add the FD associated with this object to
select.poll
object poll.
-
-
class
rdma.ibverbs.
Context
¶ Bases:
object
Verbs context handle, this is a context manager. Call
rdma.get_verbs()
to get an instance of this.-
check_poll
()¶ Return True if pevent indicates that
get_async_event()
will return data.
-
close
()¶ Free the verbs context handle and all resources allocated by it.
-
comp_channel
()¶ Create a new
rdma.ibverbs.CompChannel
for this context.
-
cq
()¶ Create a new
rdma.ibverbs.CQ
for this context.
-
end_port
¶
-
from_qp_num
()¶ Return a
rdma.ibverbs.QP
for the qp number num or None if one was not found.
-
get_async_event
()¶ Get a single async event for this context. The return result is a
namedtuple
of (event_type,obj where obj will be therdma.ibverbs.CQ
,rdma.ibverbs.QP
,rdma.ibverbs.SRQ
,rdma.devices.EndPort
orrdma.devices.RDMADevice
associated with the event.
-
handle_async_event
()¶ This provides a generic handler for async events. Depending on the event it will: - Raise a
rdma.ibverbs.AsyncError
exception - Reload cached information in the end port
-
node
¶
-
pd
()¶ Create a new
rdma.ibverbs.PD
for this context.
-
query_device
()¶ Return a
rdma.ibverbs.device_attr
for the device.Return type: rdma.ibverbs.device_attr
-
query_port
()¶ Return a
rdma.ibverbs.port_attr
for the port_id. If port_id is none then the port info is returned for the end port this context was created against.Return type: rdma.ibverbs.port_attr
-
register_poll
()¶ Add the async event FD associated with this object to
select.poll
object poll.
-
-
class
rdma.ibverbs.
MR
¶ Bases:
object
Memory registration, this is a context manager.
-
addr
¶
-
close
()¶ Free the verbs MR handle.
-
ctx
¶
-
length
¶
-
lkey
¶
-
pd
¶
-
rkey
¶
-
sge
()¶ Create a
rdma.ibv.sge
referring to length bytes of this MR starting at off. If length is -1 (default) then the entire MR from off to the end is used.
-
-
class
rdma.ibverbs.
PD
¶ Bases:
object
Protection domain handle, this is a context manager.
-
ah
()¶ Create a new
rdma.ibverbs.AH
for this protection domain. attr may be ardma.ibverbs.ah_attr
orrdma.path.IBPath
. When used with aIBPath
this function will cache the AH in the IBPath.rdma.path.Path.drop_cache()
must be called to release all references to the AH.
-
close
()¶ Free the verbs pd handle.
-
ctx
¶
-
from_qp_num
()¶ Return a
rdma.ibverbs.QP
for the qp number num or None if one was not found.
-
mr
()¶ Create a new
rdma.ibverbs.MR
for this protection domain.
-
qp
()¶ Create a new
rdma.ibverbs.QP
for this protection domain. This version expresses the QP creation attributes as keyword arguments.
-
qp_raw
()¶ Create a new
rdma.ibverbs.QP
for this protection domain. init is ardma.ibverbs.qp_init_attr
.
-
srq
()¶ Create a new
rdma.ibverbs.SRQ
for this protection domain. init is ardma.ibverbs.srq_init_attr
.
-
-
class
rdma.ibverbs.
QP
¶ Bases:
object
Queue pair, this is a context manager.
-
attach_mcast
()¶ Attach this QP to receive the multicast group described by path.DGID and path.DLID.
-
close
()¶ Free the verbs QP handle.
-
ctx
¶
-
detach_mcast
()¶ Detach this QP from the multicast group described by path.DGID and path.DLID.
-
establish
()¶ Perform
modify_to_init()
,modify_to_rtr()
and :meth`modify_to_rts`. This function is most useful for UD QPs which do not require any external sequencing.
-
max_recv_sge
¶
-
max_recv_wr
¶
-
max_send_sge
¶
-
max_send_wr
¶
-
modify
()¶ When modifying a QP the value attr.ah_attr may be a
rdma.ibverbs.ah_attr
orrdma.path.IBPath
.
-
modify_to_init
()¶ Modify the QP to the INIT state.
-
modify_to_rtr
()¶ Modify the QP to the RTR state.
-
modify_to_rts
()¶ Modify the QP to the RTS state.
-
pd
¶
-
post_recv
()¶ wrlist may be a single
rdma.ibverbs.recv_wr
or a list of them.
-
post_send
()¶ wrlist may be a single
rdma.ibverbs.send_wr
or a list of them.
-
qp_num
¶
-
qp_type
¶
-
query
()¶ Return information about the QP. mask selects which fields to return.
Return type: tuple( rdma.ibverbs.qp_attr
,:class:rdma.ibverbs.qp_init_attr)
-
state
¶
-
-
class
rdma.ibverbs.
SRQ
¶ Bases:
object
Shared Receive queue, this is a context manager.
-
close
()¶ Free the verbs SRQ handle.
-
ctx
¶
-
modify
()¶ Modify the srq_limit and max_wr values of SRQ. If the argument is None it is not changed.
-
pd
¶
-
post_recv
()¶ wrlist may be a single
rdma.ibverbs.recv_wr
or a list of them.
-
query
()¶ Return a
rdma.ibverbs.srq_attr
.
-
-
exception
rdma.ibverbs.
WCError
¶ Bases:
rdma.RDMAError
Raised when a WC is completed with error. Note: Not all adaptors support returning the opcode and qp_num in an error WC. For those that do the values are decoded.
wc is the error wc, msg is an additional descriptive message, cq is the CQ the error WC was received on and obj is a
rdma.ibverbs.SRQ
orrdma.ibverbs.QP
if one is known. is_rq is True if the WC is known to apply to the receive of the QP, and False if the WC is known the apply to the send queue of the QP. None if unknown-
cq
= None¶
-
is_rq
= None¶
-
qp
= None¶
-
srq
= None¶
-
-
rdma.ibverbs.
WCPath
()¶ Create a
rdma.path.IBPath
from a work completion. buf should be the receive buffer when this is used with a UD QP, the first 40 bytes of that buffer could be a GRH. off is the offset into buf. kwargs are applied tordma.path.IBPath
Note: wc.pkey_index is not used, if the WC is associated witha GSI QP (unlikely) then the caller can pass pkey_index=wc.pkey_index as an argument.
-
exception
rdma.ibverbs.
WRError
¶ Bases:
rdma.SysError
Raised when an error occurs posting work requests.
bad_index
is the index into the work request list what failed to post.
-
rdma.ibverbs.
wc_status_str
()¶ Convert a
rdma.ibverbs.wc.status
value into a string.