anetczuk.github.io
Feb 6, 2022 • 4 min read

Performance of Python binding libraries

I’ve recently made two projects involving call of C/C++ libraries from Python, namely technique called Python bindings. First project, dk-mage, uses Swig bindings generator. The other project, ReverseCRC, demonstrates use of following binding libraries: ctypes, CFFI and Swig.

Use case for first project is just to use functionality provided by C++ library.

Purpose of binding in second project is to improve CPU performance by providing C implementation of algorithms. Unsurprisingly, each library have different performance overhead.

Table of Contents

  1. Task
  2. Measurements
  3. Ease of use
  4. Conclusion
  5. Components versions
  6. References

Task

Main purpose of ReverseCRC project is to find CRC key based on given data frames (raw data and resulting CRC values). Solution for reversing CRC consists of tuple of values: polynomial, registry initial value and result xor value. Then search space is following: 2 ^ (3 * bit_length(CRC)).

Among others, application operates in BACKWARD mode. In order to solve CRC problem, in first step it calculates potential candidates for the solution (pairs of init and xor values) using first data frame and then uses it as input to reduced exhaustive search against rest of data frames. This approach reduces search space by one dimension.

Measurements

Each plot consists of 4 lines corresponding to different binding:

Moreover, because of nature of project, there are two kinds of measurements taken:

Each measurment is taken against following input:

Search space is following: 2^(3*16) = 2^48 ~= 281 474 980 000. For this reason one polynomial values was preselected.

As base for measurments serves BACKWARD task.

Following image shows performance comparison in direct mode: Performance against BACKWARD task

Numbers:

data_size      cffi    ctypes    swigoo   swigraw
        8  0.669771  1.200688  3.040263  1.263048
       16  0.841836  1.689529  4.114837  1.782414
       24  1.000750  2.192260  5.087483  2.290749
       32  1.187108  2.697481  6.078609  2.828679

Next image shows presents the same measurement, but taken in cached mode: Performance against BACKWARD task

Numbers:

data_size      cffi    ctypes    swigoo   swigraw
        8  0.506151  0.594125  0.647912  0.548720
       16  0.552889  0.636546  0.696747  0.594223
       24  0.600424  0.685254  0.742332  0.642770
       32  0.643498  0.738896  0.783708  0.691665

As it can be clearly seen, a lot of CPU cycles goes to conversion of data frames to raw C arrays. When data is cached, then there is only constant time difference between libraries resulting from different intermediate layers. The fastest one is cffi implementation, because it uses intermediate layer generated and compiled in C. The slowest one is swigoo implementation. It’s intermediate layer consists of Python wrapper and C wrapper at the same time. It gives flexibility in cost of aditional CPU cycles.

Ease of use

Each of libraries under consideration have it’s pros and cons.

ctypes is simplest one – do not require additional building steps. One have to write intermediate layer by himself. In case of simple C functions it is straightforward, but difficulty arises in case of structs and multitude of functions to export.

cffi prepares intermediate layer based on provided C declarations. It compiles it into C wrapper ready for import. It automatically converts lists and arrays between languages (at least in case of numeric containers).

swig follows the same path – it generates required bindings based on C/C++ declarations. Aside of the declaration it requires interface files describing generation rules. Swig is easy to use in case of simple code transformations (dk-mage is good example), but it can be not as obvious in case of fancy things like adding auto memory management to C library (see swigoo implementation).

Conclusion

It seems that in case of non trivial C++ projects swig is easiest to use. In case of macro-generated C code there is no help in swig nor in cffi. For both of them macros have to be unrolled in order to be feed to the generators. If performance is crucial, then cffi is the choice.

Components versions

Given analysis was performed based on following versions and revisions:

References

Post by: anetczuk