How to use all available cores? #25716

johnbcoughlin · 2025-01-03T04:36:42Z

johnbcoughlin
Jan 3, 2025

Here's a short program that compares (unsharded) jax.jit and jax.shard_map on a simple monolithic array operation:

import os
os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=8 --xla_cpu_multi_thread_eigen=true intra_op_parallelism_threads=4 inter_op_parallelism_threads=4'
os.environ['JAX_PLATFORM_NAME'] = 'cpu'

import jax
import jax.numpy as jnp
import numpy as np
from jax.sharding import Mesh, PartitionSpec as P
from jax.experimental.shard_map import shard_map
from jax.lib import xla_bridge
import jaxlib
import timeit

def work(q):
    return jnp.log(q)


q = jnp.array(np.random.rand(400, 100, 30, 100))

reference_result = work(q)

jitted = jax.jit(work)
assert jnp.allclose(jitted(q), reference_result)

mesh = jax.make_mesh((2, 1, 1, 1), ('x', 'y', 'z', 'w'))

sharded = jax.jit(shard_map(work, mesh, in_specs=(P('x', None, None, None),), 
                            out_specs=P('x', None, None, None)))

assert jnp.allclose(sharded(q), reference_result)

print("Jitted:", timeit.timeit(lambda: jitted(q), number=100))

print("Sharded:", timeit.timeit(lambda: sharded(q), number=100))

The numbers are basically arbitrary, but they serve to allow me to open htop and eyeball how many cores each operation is using:

The jax.jit operation uses only 4 cores, no matter what. It reaches ~360% CPU utilization.
With these settings, shard_map uses all available cores. It reaches ~1350% CPU utilization on my 12+4 perf+efficiency M3 macbook.

Here are some other permutations of the XLA_FLAGS settings.

Trying to disable all threading, unsuccessfully:

os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=1 
     --xla_cpu_multi_thread_eigen=false 
     intra_op_parallelism_threads=1 
     inter_op_parallelism_threads=1'

jax.jit ~ 360%, shard_map ~ 360% (with mesh size of (1, 1, 1, 1))

Using 2 devices:

os.environ['XLA_FLAGS'] = '--xla_force_host_platform_device_count=2
     --xla_cpu_multi_thread_eigen=false 
     intra_op_parallelism_threads=1 
     inter_op_parallelism_threads=1'

jax.jit ~ 360%, shard_map ~ 1300% (with mesh size of (2, 1, 1, 1))

What I'm trying to do

My goal is really just to get the best possible performance out of my machine. My real workload is a plasma simulation that is written with the jax.numpy API, and the core timestepping loop has an iteration time of say 300ms for a nontrivial problem.
Based on the above experiments, I have to conclude that I have zero idea how to control the CPU cores that the code is running on--I can't even get to full utilization of the available CPUs, let alone implement more sophisticated ideas like thread pinning.

I think there are two possible routes:

Continue to use jax.jit decorators on my jax.numpy code, and hope that the Eigen thread pool is sophisticated enough to make full use of the 40 cores on my production machines. What is the flag that will control the number of threads used by Eigen? Why is this apparently limited to 4 in my tests, no matter what I do?
Move to using shard_map. This is appealing in some sense because it is quite similar to MPI, which is a standard tool in our toolkit. However, MPI is only reliably fast if data is local to a core, and there aren't rogue thread pools stomping all over each other. If there's no way to ensure that each shard will get one thread, then I can't see any reason to use shard_map at all.

johnbcoughlin · 2025-01-03T04:45:59Z

johnbcoughlin
Jan 3, 2025
Author

Answering this one for myself: you should use jax.device_put to give your arrays a device sharding per https://jax.readthedocs.io/en/latest/notebooks/Distributed_arrays_and_automatic_parallelization.html#

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use all available cores? #25716

{{title}}

Replies: 1 comment

{{title}}

Select a reply

How to use all available cores? #25716

johnbcoughlin Jan 3, 2025

What I'm trying to do

Replies: 1 comment

johnbcoughlin Jan 3, 2025 Author

johnbcoughlin
Jan 3, 2025

johnbcoughlin
Jan 3, 2025
Author