Python RDMA

class center

Management, Diagnostics and Testing

Presented By:Jason Gunthorpe - CTO Obsidian Research
Date:OFA Monterey 2011-04-05

What is in it?

  • RDMA device discovery:

    for I in rdma.get_devices(): print I.name;
    
  • RDMA Verbs:

    with rdma.get_verbs(path.end_port) as ctx:
        print ctx.query_device();
    
  • IB Management:

    cpi = umad.SubnAdmGet(IBA.MADClassPortInfo);
    

Plus more!

What is Python?

class big

Python is a modern, very high level, multi-paradigm programming language:

  • Emphasis on readability and clarity
  • Modern high level features: exceptions, garbage collection, dynamic ‘duck’ typing, closures
  • Very popular for web development, finance, and for system administration.
  • Included by default and used in major Linux distributions for many years

Cognitive Dissonance

class center
class big

Python is slow!

class center
class big

RDMA is for high performance!

class center
class huge

Why!!?!?

class incremental
class big

Sometimes correct and simple is more important than fast..

class incremental
class big

... and good algorithms can help.

Package Contents

  • RDMA Device Discovery
  • Definitions from the IBA
  • IB MAD RPC handling and parallelism
  • IB subnet topology database
  • libibverbs interface (Pyrex)
  • ibtool command line program
  • Codegen’d and hand written documentation
  • Test suite

Pure Python except for rdma.ibverbs!

GPL licensed

Package Contents (2)

class small
OFA Module Python-rdma
libibmad Near 100% coverage via rdma.madtransactor and rdma.IBA
libibumad 100% coverage via rdma.umad
libibverbs 100% coverage via rdma.ibverbs (through Pyrex)
libibnetdisc ~80% coverage. No support for switch chassis grouping.
librdmacm Not covered
libibcm Not covered
infiniband-diags 45 commands re-implemented, 2 un-implemented. Review ibtool
ibutils Good coverage of the internal APIs but no coverage for the user tools.
perftest rdma_bw is implemented as an example.

It works!

$ ibtool rdma_bw 127.0.0.1
path to peer IBPath(end_port='mlx4_0/1',
    DGID=GID('fe80::2:c903:9:1edd'),
    DLID=1, MTU=4, packet_life_time=0,
    SGID=GID('fe80::2:c903:9:1edd'),
    SLID=1, dack_resp_time=15L, dqpn=524361L,
    dqpsn=6645404, drdatomic=0,
    rate=3, sack_resp_time=15L, sqpn=524360L,
    sqpsn=1754047, srdatomic=0)
MR peer raddr=7fd268a9c000 peer rkey=8002200
1000 iterations of 1048576 is 1048576000 bytes
3065.7 MB/sec

MT26428 using internal loopback, 2.8GHz i5-2300

ibtool

Re-implementation of infiniband-diags using Python as the implementation language:

  • One language
  • Greater consistency
  • Higher performance
class small

Also:

  • Test the Python RDMA core library
  • Access the unique features of the Python RDMA via the command line
  • Serve as programming examples

45 commands are implemented, > 90% complete

ibtool (2)

Mostly looks the same:

class tiny
$ ibtool ibaddr 7
GID fe80::17:77ff:feb6:2ca4 LID start 7 end 7
$ ibtool ibswitches
Switch  : 0017:77ff:feb6:2ca4 ports 2 "Obsidian Longbow X100 - LBXR43D1FF" base port 0 lid 7 lmc 0
Switch  : 0017:77ff:fef9:6e79 ports 2 "Obsidian Longbow X100 - LBXREAF28B" base port 0 lid 9 lmc 0
$ ibtool smpquery -P 2 NI -D 0,2
# Node info: DR Path (0, 2)
BaseVers:........................1
ClassVers:.......................1
NodeType:........................2
NumPorts:........................2
SystemGuid:......................0017:77ff:fef9:6e79
Guid:............................0017:77ff:fef9:6e79
PortGuid:........................0017:77ff:fef9:6e79
PartCap:.........................1
DevId:...........................0x0009
Revision:........................0x00010001
LocalPort:.......................1
VendorId:........................0x001777

ibtool (3)

Some are new:

$ ibtool perfquery --vl-xmit-wait  9
# Port counters: Lid 9 (fe80::17:77ff:fef9:6e79) port 1
PortSelect:......................1
CounterSelect:...................0x0000
PortVLXmitWait[0]:...............606
$ ibtool subnet_diff ref
Current subnet has 4 end ports, reference subnet has 4 end ports
 All end ports in the current subnet are in the reference subnet.
 All end ports in the reference subnet are in the current subnet.
Current subnet has 3 nodes, reference subnet has 3 nodes
 All nodes in the current subnet are in the reference subnet.
 All nodes in the reference subnet are in the current subnet.
Current subnet has 3 links, reference subnet has 3 links
 All links in the current subnet are in the reference subnet.
 All links in the reference subnet are in the current subnet.
 All links in the current subnet have the same rate in the reference subnet.
Current subnet has 4 LIDs, reference subnet has 4 LIDs
 All LIDs in the current subnet are the same as the reference subnet.

ibtool (4)

Section 8 of the Python RDMA manual details the various differences between ibtool and infiniband-diags:

class small
  • Greater alignment with the IBA, PR usage, timeout computations, support for routed GIDs, etc
  • Everything supports GID/GUID/LID/DR path as a TARGET
  • Better diagnostics and debug output, including packet decodes
  • –sa and support for GMP over verbs lets ibtool return info without access to /dev/umad
  • LID and SA based subnet discovery options
  • Consistent support for a discovery caching file

Library Tour - Device Discovery

class small
  • rdma.devices module - trundles through sysfs and gets devices, end ports.
  • Common basis for all other modules - umad and ibverbs are all opened based on these objects.
  • Find devices by string:
class tiny
Format Example
device mlx4_0
Node GUID 0002:c903:0000:1491
class small
  • Find ports by string:
class tiny
Format Example
device mlx4_0 (defaults to the first port)
device/port mlx4_0/1
Port GID fe80::2:c903:0:1491
Port GUID 0002:c903:0000:1491

Library Tour - Device Discovery (2)

Library features flow into ibtool:

$ ibtool ibaddr -P fe80::2:c903:0:14a6 9 -d
D: Using end port mlx4_0/2 fe80::2:c903:0:14a6
D: SMP Path 10 -> 9 SL=0 PKey=0xffff DQPN=0
      IBPath(end_port='mlx4_0/2', DLID=10,
             SLID=10, dqpn=0, qkey=0x0,
             sqpn=0)
D: RPC MAD_METHOD_GET(1) SMPFormat(1.1)
   SMPNodeInfo(17) completed to
   'Path 10 -> 9 SL=0 PKey=0xffff DQPN=0'
   len 256.
D: RPC MAD_METHOD_GET(1) SMPFormat(1.1)
   SMPPortInfo(21) completed to
   'Path 10 -> 9 SL=0 PKey=0xffff DQPN=0'
   len 256.
GID fe80::17:77ff:fef9:6e79 LID start 9 end 9

Library Tour - IBA

Structures and constants from the IBA:

  • Starts out as XML describing the precise on-the-wire structure layout
  • Processed via script into Python classes with pack, unpack and printer functions
  • 106 structures from IBA
  • Useful constants, value to string and string to value are hand written
  • Auto generate tricky things like SAFormat.componentMask

Library Tour - IBA (2)

Everything can be decoded and dumped:

$ ibtool ibaddr 9 -dd
D: Reply MAD_METHOD_GET_RESP(129) SMPFormat(1.1) SMPNodeInfo(17)
  0 01010181 baseVersion=1,mgmtClass=1,classVersion=1,method=129
  4 00000000 status=0,classSpecific=0
  8 000079FF transactionID=134139628569652
 12 D0E94434
   + data SMPNodeInfo
 64 01010202 baseVersion=1,classVersion=1,nodeType=2,numPorts=2
 68 001777FF systemImageGUID=GUID('0017:77ff:fef9:6e79')
 72 FEF96E79
 76 001777FF nodeGUID=GUID('0017:77ff:fef9:6e79')
 80 FEF96E79
 84 001777FF portGUID=GUID('0017:77ff:fef9:6e79')
 88 FEF96E79
 92 00010009 partitionCap=1,deviceID=9
 96 00010001 revision=65537
100 02001777 localPortNum=2,vendorID=6007

Library Tour - IBA (3)

Dynamic language with introspection makes this dead easy:

$ ibtool query SubnAdmGetTable SANodeRecord \
  -f nodeInfo.systemImageGUID=0017:77ff:fef9:6e79
Reply structure #0
  LID..............................9
  nodeInfo.NumPorts................2
  nodeInfo.SystemImageGUID.........0017:77ff:fef9:6e79
  nodeInfo.PortGUID................0017:77ff:fef9:6e79
  nodeInfo.VendorID................0x001777
  nodeDescription.NodeString.......'Obsidian Longbow X100 - LBXREAF28B'

45 LOC! - perform any RPC, with any arguments and pretty print the result. Widely used in implementing ibtool.

Library Tour - MAD Handling

  • Two MAD QP interfaces - rdma.umad (SMP and GMP) and rdma.vmad (only GMP)
  • Simplified programming model for issuing RPC MADs, RPC errors are converted into exceptions. checks, parsing and RMPP are centralized.
  • rdma.SATransactor transparently converts SMP RPCs into SA RPCs - enables all tools to use VMAD and return data from the SA.
  • rdma.sched parallelizes MAD RPCs - extremely easy to use, major performance win. Used extensively in ibtool

Library Tour - MAD Handling (2)

$ ibtool ibaddr 10 --sa -d
D: RPC MAD_METHOD_GET(1) SAFormat(3.2)
     SANodeRecord(17) completed to 'Path 8 -> 8 SL=0 PKey=0xffff DQPN=1' len 256.
D: RPC MAD_METHOD_GET(1) SAFormat(3.2)
     SAPortInfoRecord(18) completed to 'Path 8 -> 8 SL=0 PKey=0xffff DQPN=1' len 256.
GID fe80::2:c903:0:14a6 LID start 10 end 10
$ ibtool ibnetdiscover --sa -d
D: Performing discovery using mode 'SA'
D: RPC MAD_METHOD_GET_TABLE(18) SAFormat(3.2)
    SANodeRecord(17) completed to 'Path 8 -> 8 SL=0 PKey=0xffff DQPN=1' len 504.
D: RPC MAD_METHOD_GET_TABLE(18) SAFormat(3.2)
    SAPortInfoRecord(18) completed to 'Path 8 -> 8 SL=0 PKey=0xffff DQPN=1' len 568.
D: RPC MAD_METHOD_GET_TABLE(18) SAFormat(3.2)
    SALinkRecord(32) completed to 'Path 8 -> 8 SL=0 PKey=0xffff DQPN=1' len 104.

Library Tour - MAD Parallelism

Python Co-Routines - one thread, multiple execution contexts:

def get_pinf(sched,path,idx):
  pinf = yield sched.SubnGet(IBA.SMPPortInfo,
                 path,idx);
sched.mqueue(get_pinf(sched,path,idx)
        for I in range(1,ninf.numPorts+1));

Run numPorts copies of get_pinf in parallel. Automatically limits outstanding RPCs, tracks completion, manages timeouts, etc.

Library Tour - IB Subnet

Fetch, store and manipulate an IB subnet:

  • Discovery via DR SMP, LID SMP or SA SubnAdmGetTable
  • Incremental out of order loading
  • Save/Load to a Python pickle
  • Iterate, BFS iterate, lookup by GUID, etc.

Library Tour - IB Subnet (2)

All ibtool discovery using functions support common options and caching:

$ ibtool ibnetdiscover --cache disc \
        --refresh-cache
$ ibtool ibcheckerrors --cache disc
## Summary: 4 nodes checked, 0 bad nodes found
##          8 ports checked, 0 ports with bad state found
##          4 ports checked, 0 ports have errors beyond threshold

No MADs will be issued by ibcheckerrors

Library Tour - Verbs

Easy to use wrappers around verbs:

with get_verbs(path.end_port) as ctx:
    cq = ctx.cq(100,ctx.comp_channel());
    pd = ctx.pd();
    qp = pd.qp(ibv.IBV_QPT_UD,100,100,cq);
  • Errors are raised as exceptions
  • libibverbs functions cast into objects
  • Reference counting and Python context managers ensure correct resource cleanup

Library Tour - Verbs (2)

Simplifications for WC processing:

poller = CQPoller(cq);
for wc in poller.iterwc(timeout=1):
 if wc.status != ibv.IBV_WC_SUCCESS:
   raise ibv.WCError(wc,cq,obj=qp);
  • Iterate over WC’s, block with poll
  • Transparently handle async events
  • Place a timeout around the entire for loop
  • Messy details to prevent races are hidden
  • WC errors raise as exceptions and pretty print

Library Tour - Verbs (3)

Tight integration with IBPath concept:

path = get_mad_path(umad,"10");
qp.establish(path);
qp.post_send(ibv.send_wr(
   opcode=ibv.IBV_WR_SEND,ah=pd.ah(path),
   remote_qpn=path.dqpn,remote_qkey=path.qkey));
  • Caches AH construction
  • Verbs modify_qp draws information from the path (eg pkey, qkey, psn, etc)
  • Works for UD, UC and RC,
  • Can also get a path from a WC

Summary

class big
  • Great for writing management tools
  • Very time efficient for test development, training and prototyping
  • ibtool is an improved, simpler and more maintainable version of the diags programs

Thanks!

class center
class big
Read the manual!
Try it out!
class incremental
class center
          (__)
          (oo)
   /------\/
 / |    ||
*  /\---/\
   ~~   ~~