Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from tensorflow:master #169

Open
wants to merge 1,648 commits into
base: master
Choose a base branch
from

Conversation

pull[bot]
Copy link

@pull pull bot commented Dec 8, 2024

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

@pull pull bot added the ⤵️ pull label Dec 8, 2024
rtg0795 and others added 29 commits January 13, 2025 13:07
PiperOrigin-RevId: 715080397
Imported from GitHub PR openxla/xla#21375

In later stages of optimization, there are instances of copy fusion on the parameter of the while body. With this, we need to allow inlining of fusions while getting the induction variable index, otherwise we cannot deduce the tuple index.
Copybara import of the project:

--
ae85690876a106c4d74715fed299779e29e8e641 by Shraiysh Vaishay <svaishay@nvidia.com>:

[ds-fusion] Get While loop analysis with copy fusion

In later stages of optimization, there are instances of copy fusion on
the parameter of the while body. With this, we need to allow inlining of
fusions while getting the induction variable index, otherwise we cannot
deduce the tuple index.

Merging this change closes #21375

PiperOrigin-RevId: 715080653
Optimize this by replacing multiplication
with advancing the pointer every iteration. Also avoid reloading depth/etc.
from args every time.

Fixing benchmark for depthwise conv and running them I get a lot of noise,
but it seems positive overall.

name                                                                                   old cpu/op   new cpu/op   delta
BM_ConvFloatDepthwiseFwdCPU1_conv0_float/real_time  [32_112_112_3_8_24_3_3_1_2_cpu1 ]  33.4µs ±16%  34.7µs ±28%     ~     (p=0.284 n=38+39)
BM_ConvFloatDepthwiseFwdCPU4_conv0_float/real_time  [32_112_112_3_8_24_3_3_1_2_cpu4 ]  27.3µs ±57%  26.6µs ±52%     ~     (p=0.556 n=40+40)
BM_ConvFloatDepthwiseFwdCPU1_conv1_float/real_time  [32_112_112_64_1_64_3_3_1_2_cpu1]  35.6µs ±24%  36.3µs ±27%     ~     (p=0.283 n=35+40)
BM_ConvFloatDepthwiseFwdCPU4_conv1_float/real_time  [32_112_112_64_1_64_3_3_1_2_cpu4]  30.0µs ±27%  31.1µs ±33%     ~     (p=0.377 n=36+34)
BM_ConvFloatDepthwiseFwdCPU1_conv2_float/real_time  [32_56_56_128_1_128_3_3_1_2_cpu1]  32.8µs ±14%  33.1µs ±18%     ~     (p=0.761 n=33+38)
BM_ConvFloatDepthwiseFwdCPU4_conv2_float/real_time  [32_56_56_128_1_128_3_3_1_2_cpu4]  25.7µs ±57%  26.4µs ±55%     ~     (p=0.609 n=40+40)
BM_ConvFloatDepthwiseFwdCPU1_conv3_float/real_time  [32_56_56_128_1_128_3_3_2_2_cpu1]  32.2µs ±17%  31.7µs ±12%     ~     (p=0.204 n=37+35)
BM_ConvFloatDepthwiseFwdCPU4_conv3_float/real_time  [32_56_56_128_1_128_3_3_2_2_cpu4]  27.8µs ±32%  27.0µs ±24%     ~     (p=0.341 n=34+39)
BM_ConvFloatDepthwiseFwdCPU1_conv4_float/real_time  [32_28_28_128_1_128_3_3_1_2_cpu1]  32.1µs ±13%  31.9µs ±12%     ~     (p=0.470 n=39+36)
BM_ConvFloatDepthwiseFwdCPU4_conv4_float/real_time  [32_28_28_128_1_128_3_3_1_2_cpu4]  26.2µs ±30%  25.5µs ±44%     ~     (p=0.677 n=38+37)
BM_ConvFloatDepthwiseFwdCPU1_conv5_float/real_time  [32_14_14_512_1_512_3_3_1_2_cpu1]  31.5µs ±18%  31.7µs ±17%     ~     (p=0.742 n=38+39)
BM_ConvFloatDepthwiseFwdCPU4_conv5_float/real_time  [32_14_14_512_1_512_3_3_1_2_cpu4]  28.5µs ±28%  27.3µs ±29%     ~     (p=0.208 n=35+37)
BM_ConvFloatDepthwiseFwdCPU1_conv6_float/real_time  [32_7_7_1024_1_1024_3_3_1_2_cpu1]  29.3µs ±16%  28.9µs ±21%     ~     (p=0.334 n=39+31)
BM_ConvFloatDepthwiseFwdCPU4_conv6_float/real_time  [32_7_7_1024_1_1024_3_3_1_2_cpu4]  8.35µs ±62%  7.08µs ±46%  -15.24%  (p=0.026 n=40+37)
BM_ConvFloatDepthwiseFwdCPU1_conv7_float/real_time  [32_112_112_3_8_24_3_3_2_2_cpu1 ]  31.2µs ±17%  31.4µs ±22%     ~     (p=0.987 n=35+38)
BM_ConvFloatDepthwiseFwdCPU4_conv7_float/real_time  [32_112_112_3_8_24_3_3_2_2_cpu4 ]  25.9µs ±45%  26.5µs ±32%     ~     (p=0.859 n=39+38)
BM_ConvFloatDepthwiseFwdCPU1_conv8_float/real_time  [32_112_112_3_8_24_3_3_2_1_cpu1 ]  30.0µs ±16%  30.5µs ±18%     ~     (p=0.228 n=34+33)
BM_ConvFloatDepthwiseFwdCPU4_conv8_float/real_time  [32_112_112_3_8_24_3_3_2_1_cpu4 ]  26.2µs ±41%  24.4µs ±53%     ~     (p=0.288 n=36+40)
BM_ConvFloatDepthwiseFwdCPU1_conv9_float/real_time  [1_100_100_72_1_72_3_3_1_2_cpu1 ]  26.5µs ±16%  25.6µs ±15%     ~     (p=0.051 n=34+37)
BM_ConvFloatDepthwiseFwdCPU4_conv9_float/real_time  [1_100_100_72_1_72_3_3_1_2_cpu4 ]  6.33µs ±37%  5.60µs ±36%  -11.46%  (p=0.011 n=40+35)
BM_ConvFloatDepthwiseFwdCPU1_conv10_float/real_time [1_100_100_72_1_72_5_5_1_2_cpu1 ]  26.4µs ±13%  27.8µs ±20%     ~     (p=0.140 n=33+40)
BM_ConvFloatDepthwiseFwdCPU4_conv10_float/real_time [1_100_100_72_1_72_5_5_1_2_cpu4 ]  14.6µs ±79%   9.2µs ±90%  -36.75%  (p=0.000 n=40+40)

PiperOrigin-RevId: 715085497
This function is the struct size checker that's used only on the plugin side (and is only valid there since it's checking to see if the struct is greater than or equal). Make the error text it generates clear that the plugin version is later than the framework, and hence, is an unsupported combo.

PiperOrigin-RevId: 715086761
…ns when breaking circular cp for better traceability.

PiperOrigin-RevId: 715104228
…is CL makes two minor tweaks to the `BasicStringArray` class (the string array implementation in the PjRt-IFRT backend): (1) `CopyToHostBuffer` now supports the host buffer semantics of `kImmutableUntilTransferCompletes`. (2) `FullyReplicated` now works with `ConcreteSharding`.

PiperOrigin-RevId: 715110567
PiperOrigin-RevId: 715112095
…duce more of its member function to C Api.

PiperOrigin-RevId: 715113720
…ting during instruction fusion.

The computations are not being sorted in a semantically meaningful order; they are sorted by instruction count with ties being broken consistently but arbitrarily (based on a hash of the string representation of the computation). There is therefore no reason why these passes need to traverse the computations in this specific order.

PiperOrigin-RevId: 715122377
PiperOrigin-RevId: 715131342
…ot_handler.cc.

This function creates a copy for the HloInstruction in the given PartitionedHlo and returns a new PartitionedHlo for the copy. This can be reused by other operators (like gather/scatter).

PiperOrigin-RevId: 715138672
Breaking internal tests

Reverts 5e78ccd

PiperOrigin-RevId: 715160248
…stom op.

Attributes will determine behavior

PiperOrigin-RevId: 715161504
PiperOrigin-RevId: 715162849
Only bitcast requires the layout to be known when evaluating HLO.
In all other cases, we can evaluate without knowing the layout.
This is needed for collective pipelining where we have to analyse while loops before layouts were assigned.

PiperOrigin-RevId: 715163612
A gather has two operands, input and indices. If they point to the same instruction, create a copy for indices.

A scatter has n inputs, 1 indices, and n updates (2n+1 operands in total). We allow overlap between n inputs. We also allow overlap between n updates. We need to create a copy if
* indices overlap with any input or update
* update overlap with any input

The added copy will be removed if it is redundant in the following memory related passes (e.g., CopyInsertion).

PiperOrigin-RevId: 715164959
This CL takes care of
1. Migrating external projects dependencies from

```
tensorflow/compiler/xla:test
tensorflow/compiler/xla:test_helpers
tensorflow/compiler/xla/service:pattern_matcher_gmock
```

to `tensorflow/compiler/xla/hlo/testlib:*`

PiperOrigin-RevId: 715167258
…dleDotHelper`.

`HandleDotHelper` is called once for a single dot operation, while `PartitionDot` can be called many times. We need to consider adding a copy only once.

PiperOrigin-RevId: 715189518
Imported from GitHub PR openxla/xla#21104

Transpose folding pass doesn't preserve backend config when creating the new dot with transpose folded. Changing the behavior to copy the old dot's config to the new dot.
Copybara import of the project:

--
d2d6b628af1cab777a210e4ac62184e52fe9f4a9 by TJ Xu <tjx@nvidia.com>:

Preserve backend config when folding transpose

--
6b5fa3a1cb70a790803e3ac57ff8329690e88e5e by TJ Xu <tjx@nvidia.com>:

use SetupDerivedInstruction instead of just copying the backend config

Merging this change closes #21104

PiperOrigin-RevId: 715204523
This method was renamed but staging function kept, switch to renamed variant.

PiperOrigin-RevId: 715208433
PiperOrigin-RevId: 715229132
cota and others added 30 commits January 17, 2025 07:09
Add a "simple" scatter benchmark with a reduce combiner.

PiperOrigin-RevId: 716654572
global_data.h is deprecated

PiperOrigin-RevId: 716654755
These will soon be shared between CPU and GPU.

PiperOrigin-RevId: 716666785
Updating:
 - `env.h`
 - `env_time.h`
 - `errors.h`
 - `file_statistics.h`
 - `file_system.h`
 - `file_system_helper.h`
 - `logging.h`
 - `macros.h`
 - `status.h`
 - `status_matchers.h`
 - `status_to_from_proto.h`
 - `statusor.h`
 - `test.h`
 - `test_benchmark.h`
 - `threadpool.h`
 - `threadpool_async_executor.h`
 - `threadpool_interface.h`
 - `threadpool_options.h`
 - `types.h`

and associated targets.

PiperOrigin-RevId: 716721921
Add DutyCycleCombiner for handling intra and inter chip duty cycle aggregation.
Fix DutyCycleTracker bugs with idleness and duplicate active times.

PiperOrigin-RevId: 716741025
…ting behavior unchanged.

Only when coordination_agent_recoverable is set, it tries to reconnect to the cluster and would lead to AlreadyExists error. In this case the already_existing error can be handled by checking the existing topology is same as the new one.

PiperOrigin-RevId: 716748621
…ets.

This step towards encouraging extrenal projects to migrate to the already
migrated hlo sub-components.

PiperOrigin-RevId: 716762154
… reshape ops

Allow a single bounded dynamic dimension. This is likely a short term fix as bounded dynamism as a while likely needs a lot of thought, but this solution with a single bounded dim is unambiguous so should be safe.

PiperOrigin-RevId: 716763010
Updates LLVM usage to match
[bf17016a92bc](llvm/llvm-project@bf17016a92bc)

PiperOrigin-RevId: 716773763
The attribute should be named `channel_handle`, not `channel_id`.

PiperOrigin-RevId: 716822019
…on host memory as host compute. This, of course, excludes DynamicUpdateSlices which are used for host offloading DMAs.

PiperOrigin-RevId: 716839236
PiperOrigin-RevId: 716844952
PiperOrigin-RevId: 716860231
PiperOrigin-RevId: 716914254
PiperOrigin-RevId: 716914439
…size, not

just the size of the newly added values.

PiperOrigin-RevId: 716915408
PiperOrigin-RevId: 716930131
PiperOrigin-RevId: 716939752
PiperOrigin-RevId: 716945103
PiperOrigin-RevId: 716955798
PiperOrigin-RevId: 716956415
PiperOrigin-RevId: 716958380
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.