Working with Foreign Functions

Suppose your Python library needs to load some sensitive binary data from a file into a contiguous block of memory (e.g., in order to use it for some application-specific operation). Furthermore, you have some additional requirements that must be met for security and auditing purposes:

  • you need to ensure that your code does not inadvertently cause the interpreter to copy any part of the data loaded from the file into some other region of memory,
  • you need to log the memory address at which the data was stored, and
  • you need to clear the memory region that held the data by overwriting it with random bytes.

One strategy you might employ in order maintain tight control over what your code is doing is to use C functions found in a compiled shared library to read the data from disk, to load that data into a region of memory, and at the end to clear that region. What minimal collection of built-in Python features will you need to invoke functions that are found in a shared library? How can you transform Python values (such as strings representing the location of the file) into an appropriate form on which the function can operate?

Python offers a rich set of capabilities via the built-in ctypes library that make it possible to invoke (or wrap in a Python function) foreign functions that have been implemented using another language (such as C/C++) and compiled into shared libraries. This article reviews the basics of employing foreign functions by demonstrating how to load and apply to the above use case the instance of the GNU C Library available on most operating systems. The same techniques can be used for any shared library. An alternative approach used by some popular Python packages is briefly reviewed, as well.

Loading a Shared Library

To load a shared library file for which you know the relative or absolute path, you can normally use the LoadLibrary method of either the cdll or the windll instance (depending on your operating system) of the LibraryLoader class found in ctypes. For the purposes of the use case in this article, it is sufficient to load the GNU C Library. In the example below, the system function from the platform library is used to distinguish between Windows and Linux/macOS environments. In the Linux/macOS case, the find_library function is used to determine the absolute path of the shared library.

In [1]:
import ctypes
import platform

if platform.system() == "Windows":
    libc = ctypes.windll.msvcrt
else:
    libc = ctypes.cdll.LoadLibrary(ctypes.util.find_library("c"))

Invoking Foreign Functions

The first portion of your workflow involves loading a file into memory. The Python code below writes a file to disk that contains a sequence of 32 random bytes. The file can be used to test the workflow.

In [2]:
from secrets import token_bytes
with open("data.txt", "wb") as file:
    file.write(token_bytes(32))

The C function fopen expects two arguments: a pointer to the first character of a string that represents the path of the file, and a pointer to the first character of the string that represents the mode (i.e., reading or writing) in which the file is opened. You can use the c_char_p function to turn Python strings into a representation in memory that can be handled by the C function. Note the use of the encode string method to provide an explicit encoding for the string as a byte sequence.

In [3]:
from ctypes import c_char_p
file = c_char_p("data.txt".encode("ascii"))
mode = c_char_p("rb".encode("ascii"))

Unfortunately, it is not possible within Python to examine the libc object that was created by the LibraryLoader instance in order to determine what symbols are defined within it. However, in this case we know that the functions fopen, fread, and fclose must exist. For each of these functions, an instance of the FuncPtr class can be found in libc.

In [4]:
fopen = libc.fopen
fread = libc.fread
fclose = libc.fclose

Before you can safely invoke these functions, you need to specify their argument types and their return types. This can be accomplished by first consulting the GNU C Library documentation to find the signature for each of the C functions you would like to use. Then, the appropriate data type classes can be used to assign the correct sequence of argument types and the correct return type to the argtypes and restype attributes, respectively, of each FuncPtr class instance.

In [5]:
from ctypes import c_int, c_size_t, c_void_p

fopen.argtypes = [c_char_p, c_char_p]
fopen.restype = c_void_p

fread.argtypes = [c_void_p, c_size_t, c_size_t, c_void_p]
fread.restype = c_size_t

fclose.argtypes = [c_void_p]
fclose.restype = c_int

It is now possible to invoke these functions on some inputs. You can allocate a memory buffer for the 32 bytes of data that you will be loading from the file using the create_string_buffer function.

In [6]:
from ctypes import create_string_buffer
data = ctypes.create_string_buffer(32)

You can now open the file, load the data, and close the file.

In [7]:
fp = fopen(file, mode)
fread(data, 32, 1, fp)
fclose(fp)
bytes(data).hex()
Out[7]:
'4084da448bca81ec463d465cf7159f27adb7cc468675dad0251a4511bce60e20'

You can determine the memory address corresponding to the memory buffer data using the addressof function.

In [8]:
from ctypes import addressof
hex(addressof(data))
Out[8]:
'0x5acbae0'

You can now clear the memory region. The example below uses the memset C function for this purpose. An example that uses a random sequence generator that is appropriate for cryptographic applications appears in the next section.

In [9]:
libc.memset.argtypes = [c_void_p, c_int, c_size_t]
libc.memset(data, 0, 32)
bytes(data).hex()
Out[9]:
'0000000000000000000000000000000000000000000000000000000000000000'

Alternative Approaches

The C Foreign Function Interface library is similar to the built-in ctypes module and is used by some popular packages, including the PyNaCl library that acts as a Python interface for the cryptographic library libsodium. In the example below, the C implementation of the randombytes function is invoked on a character buffer bs and then the contents of that buffer are displayed.

In [10]:
from nacl import _sodium
lib = _sodium.lib

from cffi import FFI
ffi = FFI()

bs = ffi.new("unsigned char[]", 8)
lib.randombytes(bs, 8)
bytes(bs).hex()
Out[10]:
'cbc9af4a2028531d'

You might choose to call the C implementation directly for a variety of reasons, including to improve performance. This may be useful to do when a more high-level library method allocates new memory for a byte sequence during every invocation, while your own solution can reuse the same memory over and over to store each new batch of bytes. In the example below, the time to invoke the C function over one million iterations is measured.

In [11]:
import time
start = time.perf_counter()
for _ in range(10**6):
    lib.randombytes(bs, 8)
time.perf_counter() - start
Out[11]:
2.1746324000000006

The below example measures the amount of time it takes to invoke the Python wrapper in PyNaCl over the same number of iterations. The longer running time may be the result of a number of factors; regardless of the underlying reason that may apply for any particular function, the example demonstrates that direct access to the C method gives you more control over those factors.

In [12]:
from nacl.bindings import randombytes
start = time.perf_counter()
for _ in range(10**6):
    bs = randombytes(8)
print(time.perf_counter() - start)
5.350842000000001

Further Reading

If you are interested in learning what other C functions can be used in the manner described in this article, you can review The GNU C Library Reference Manual. In addition to ctypes and CFFI, there exist specialized variants such as the NumPy-specific numpy.ctypeslib library (which comes with features that make it easier to package and deliver NumPy data structures to C functions). It is also possible to implement extension modules for Python in C/C++. Useful definitions and guidelines are provided that make it possible to write C/C++ code that interacts in appropriate ways with the Python interpreter. If you would like to leverage even more interoperation between Python and C/C++ code, you can investigate the Cython compiler.