libnuma

默认情况下系统自动调整进程的NUMA策略。如果想自己控制NUMA使用方式，那么就需要通过API接口来自行指定。目前只了解到 libnuma ¹ 这个库可以进行NUMA策略编程控制，关于这个库，网上搜不到太多的资料。同时因为对NUMA的理解不透彻，很多API的作用都看不明白。本文简单的梳理一下目前用到的开发流程和API的用法。

开发流程

安装开发库 yum -y install numactl-devel
开始编写代码，在使用 libnuma 提供的函数时，必须先判断当前系统是否支持numa，需要调用 int numa_available(void) ，如果函数返回 -1 ，那么 libnuma 函数库中所有其它函数的行为都是未定义的。
对代码进行编译，因为用到了 libnuma 库，所以需要在链接选项中指定 libnuma 的链接参数。

  // numa_demo.c

  #include <numa.h>

  int main() {
    if (numa_available() == -1) {
      return -1;
    }
    return 0;
  }

编译命令为：

  gcc -o numa_demo -lnuma numa_demo.c

如果以上代码以正确编译，说明整个开发环境已经准备就绪。如果需要查看 libnuma 的手册，可以通过 man 3 numa 命令来查看。

数据结构

libnuma 库主要用到两个结构，一个是 struct bitmask ，另一个是 nodemask_t 。

`struct bitmask`

  struct bitmask {
      unsigned long size; /* number of bits in the map */
      unsigned long *maskp;
  };

最初看到这个结构时候在想，为什么 bitmask 设计上两块内存分离的方式（Linux内核中也是如此），而不是计成连续内存存储方式：

  struct bitmask {
    unsigned long size;
    unsigned long maskp[0];
  };

后来看了 numactl 源码发现，这个 maskp 可以指向其它内存，如果设计成连续的就很难实现这样的用法。

  void numa_bind_v1(const nodemask_t *nodemask)
  {
          struct bitmask bitmask;

          bitmask.maskp = (unsigned long *)nodemask;
          bitmask.size  = sizeof(nodemask_t);
          numa_run_on_node_mask_v2_int(&bitmask);
          numa_set_membind_v2_int(&bitmask);
  }

`node_mask_t`

  #if defined(__x86_64__) || defined(__i386__)
  #define NUMA_NUM_NODES  128 
  #else
  #define NUMA_NUM_NODES  2048
  #endif

  typedef struct {
          unsigned long n[NUMA_NUM_NODES/(sizeof(unsigned long)*8)];
  } nodemask_t;

API 详解

int numa_available(void)

这个函数用来判断当前系统是否支持NUMA，如果不支持，函数返回 -1 ，如果当前系统支持NUMA那么这个函数返回值是 0 。这个函数是使用numa库时第一个需要调用的函数，如果这个函数返回 -1 ，那么其它所有函数的行为都是未定义的。

struct bitmask *numa_allocate_cpumask()

会分配一块内存，这块掩码的大小和内核中cpu所占掩码大小一致，换句话说就掩码大小足够大，可以用来处理NR_CPUS个CPU。CPU的数量可以通过调用 numa_num_possible_cpus() 来获取。

void numa_free_cpumask()

struct bitmask *numa_get_run_node_mask(void)

int numa_max_possible_node(void)

这个函数用来返回当前 libnuma 库支持的NUMA节点最大值编号。因为 node 编号是从 0 开始的，这个值加 1 就是 numa_num_possible_nodes() 函数的值。

int numa_num_possible_nodes()

这个函数用来返回当前 libnuma 库支持的NUMA节点数量上限。因为 node 编号是从 0 开始的，这值减 1 就是 numa_max_possible_node() 函数的值。

int numa_max_node(void)

这个函数获取当前最大的可用节点编号。因为编号是从 0 开始的，所以这个值不是节点的数量。在NUMA节点编号连续²的情况下，这个值加 1 才是节点的数量。对任何超过这个 node 编号进行操作，行为都是未定义的。

int numa_num_configured_nodes()

这个函数获取当当前系统内存节点的数量。这个数量包含所有当前被禁用³的 node 。这个值是根据系统文件 /sys/devices/system/node 推导出来的。

int numa_node_to_cpus(int node, struct bitmask *mask)

这个函数根据传入的 node 编号，获取属于该 node 的 逻辑核 掩码。掩码信息存储在参数 mask 中，参数 mask 需要先通过调用函数 numa_allocate_cpumask() 来申请。这个函数可以保证 mask 的大小足够容纳所有可能的 逻辑核 。如果 mask 的空间不够大，函数将返回 -1 ，同时设置 errno 为 ERANG 。如果函数执行成功，那么返回值为 0 。

以下示例代码表示，获取 node 1 相关的 逻辑核 编号：

  struct bitmask *cpus_mask = numa_alloate_cpumask();
  int node_id = 1;
  int ret = numa_node_to_cpus(node_id, cpus_mask);
  for (int i = 0; i < cpus_mask->size; i++) {
    if (numa_bitmask_isbitset(cpus_mask, i)) {
      printf("cpu %d is on node %d\n", i, node_id);
    }
  }

struct bitmask *numa_bitmask_alloc(unsigned int n)

struct bitmask numa_bitmask_clearall(struct bitmask bmp)

struct bitmask numa_bitmask_clearbit(struct bitmask bmp, unsigned int n)

int numa_bitmask_equal(const struct bitmask bmp1, const struct bitmask bmp2)

void numa_bitmask_free(struct bitmask *bmp)

int numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int n)

这个函数用来判断掩码 bmp 的第 n 个bit位是否被置 1 。如果被置 1 函数返回 1 ，否则函数返回 0 。如果指定的 n 的值超过 bmp 的大小，返回值是也是 0 。

unsigned int numa_bitmask_nbytes(struct bitmask *bmp)

struct bitmask numa_bitmask_setall(struct bitmask bmp)

struct bitmask numa_bitmask_setbit(struct bitmask bmp, unsigned int n)

void copy_bitmask_to_nodemask(struct bitmask bmp, nodemask_t nodemask)

void copy_nodemask_to_bitmask(nodemask_t nodemask, struct bitmask bmp)

void copy_bitmask_to_bitmask(struct bitmask bmpfrom, struct bitmask bmpto)

unsigned int numa_bitmask_weight(const struct bitmask *bmp )

Footnotes

https://github.com/numactl/numactl

node 编号是否一定连续？

node 被禁用是什么样的情况？