Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant accesses due to missing predicates in resize #3710

Open
naoyam opened this issue Jan 15, 2025 · 0 comments · May be fixed by #3713
Open

Redundant accesses due to missing predicates in resize #3710

naoyam opened this issue Jan 15, 2025 · 0 comments · May be fixed by #3713
Assignees

Comments

@naoyam
Copy link
Collaborator

naoyam commented Jan 15, 2025

Just realized a pad may be scheduled in a redundant way with the resize scheduler. For example:

TEST_F(PredicateIndexingTest, SplitAndResize) {
  Fusion fusion;
  FusionGuard fg(&fusion);

  const int64_t i0 = 4;
  const int64_t i1 = 32;

  // [i0*i1]
  auto tv0 = makeContigConcreteTensor({i0 * i1});
  fusion.addInput(tv0);

  auto tv1 = set(tv0);
  // [i0, i1]
  auto tv2 = reshape(tv1, {IrBuilder::create<Val>(i0), IrBuilder::create<Val>(i1)});
  // [i0, i1 * 2]
  auto tv3 = pad(tv2, {zero, IrBuilder::create<Val>(i1)});
  auto tv4 = set(tv3);
  fusion.addOutput(tv4);

  scheduler_tools::propagateResizeToInputs(tv3->definition());

Here, it just manually reproduces what the resize scheduler would do by propagating the resize of the pad. The generated kernel would look like:

__global__ void CUDAGeneratedKernel(Tensor<float, 1, 1> T0, Tensor<float, 2, 2> T4) {
  float T1[256LL];
  #pragma unroll
  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i1;
    i1 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      T1[(i1 + i2)] = 0LL;
    }
  }
  #pragma unroll
  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i3;
    i3 = 32LL * i0;
    nvfuser_index_t i4;
    i4 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      if (((96LL + i2) < 128LL)) {
        T1[(i4 + i2)]
           = T0[(i3 + i2)];
      }
    }
  }
  float T2[256LL];
  #pragma unroll
  for(nvfuser_index_t i5 = 0LL; i5 < 4LL; ++i5) {
    nvfuser_index_t i6;
    i6 = 64LL * i5;
    #pragma unroll
    for(nvfuser_index_t i7 = 0LL; i7 < 64LL; ++i7) {
      nvfuser_index_t i8;
      i8 = i6 + i7;
      T2[i8]
         = T1[i8];
    }
  }
  // Alias Allocation - register
  auto& T3 = T1;
  #pragma unroll
  for(nvfuser_index_t i9 = 0LL; i9 < 4LL; ++i9) {
    nvfuser_index_t i10;
    i10 = 64LL * i9;
    #pragma unroll
    for(nvfuser_index_t i11 = 0LL; i11 < 64LL; ++i11) {
      nvfuser_index_t i12;
      i12 = i10 + i11;
      T3[i12]
         = (i11 < 32LL) ? T2[i12] : 0.000000000e+00f;
    }
  }
  #pragma unroll
  for(nvfuser_index_t i13 = 0LL; i13 < 4LL; ++i13) {
    nvfuser_index_t i14;
    i14 = 64LL * i13;
    #pragma unroll
    for(nvfuser_index_t i15 = 0LL; i15 < 64LL; ++i15) {
      nvfuser_index_t i16;
      i16 = i14 + i15;
      T4[i16]
         = T3[i16];
    }
  }
}

Here, the important part is the predicate of the T0 read:

  for(nvfuser_index_t i0 = 0LL; i0 < 4LL; ++i0) {
    nvfuser_index_t i3;
    i3 = 32LL * i0;
    nvfuser_index_t i4;
    i4 = 64LL * i0;
    #pragma unroll
    for(nvfuser_index_t i2 = 0LL; i2 < 64LL; ++i2) {
      if (((96LL + i2) < 128LL)) {   // <-- here
        T1[(i4 + i2)]
           = T0[(i3 + i2)];
      }
    }
  }

This is because T1 is scheduled as:

T1_l_float[iS11{4}, iS13{64}]
 logical domain : (iS1{128})
 contiguity: t
  Outer split: iS1{128} by factor 4 -> iS11{4}, iS12{32}
  Resize: iS12{32} by 0 and 32 -> iS13{64}
 loop domain : (iS11{4}, iS13{64})

The loop indices of i0 and i2 correspond to the loop IDs of iS11 and iS13, respectively. We predicate the logical domain of T1 by generating the corresponding index for iS1.

While this should have no issue with the correctness, half of the accesses would be redundant as they would be masked out by the actual pad expression:

      T3[i12]
         = (i11 < 32LL) ? T2[i12] : 0.000000000e+00f;

This is because of the split of the reshape. Due to the split, indices that exceed the resize input extent of iS12 may not be predicated out. This is quite similar to the non-divisible split, but because of the resize it can happen even with divisible splits like this case.

I don't think this is a correctness issue, but it's likely a performance issue unless nvrtc is smart enough to figure out the accesses are indeed redundant and eliminate the redundant accesses, which is unlikely.

@naoyam naoyam self-assigned this Jan 15, 2025
naoyam added a commit that referenced this issue Jan 15, 2025
naoyam added a commit that referenced this issue Jan 15, 2025
@naoyam naoyam linked a pull request Jan 15, 2025 that will close this issue
naoyam added a commit that referenced this issue Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant