Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CPU] Enable memory reuse for nested graphs #27521

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

EgorDuplensky
Copy link
Contributor

@EgorDuplensky EgorDuplensky commented Nov 12, 2024

Details:

  • All the nested graphs are now must be a part of a global memory reuse logic
  • The core logic of the memory reuse is untouched (a bit refactored)
  • Instead of solving memory reuse for every graph / subgraph, now all the edges and global execution indices are collected from the "virtually flatten" graph first and then memory reuse is solved once for a model.
  • All the nodes with nested graphs are updated, including:
    1. LoRa
    2. Composite
    3. If
    4. TensorIterator
    5. Convolution + Sum fallback subgraph

Tickets:

  • ticket-id

@EgorDuplensky EgorDuplensky requested review from a team as code owners November 12, 2024 10:53
@github-actions github-actions bot added category: inference OpenVINO Runtime library - Inference category: CPU OpenVINO CPU plugin category: build OpenVINO cmake script / infra labels Nov 12, 2024
@EgorDuplensky EgorDuplensky changed the title Enable memory reuse for nested graps Enable memory reuse for nested graphs Nov 12, 2024
@EgorDuplensky EgorDuplensky changed the title Enable memory reuse for nested graphs [CPU] Enable memory reuse for nested graphs Nov 12, 2024
@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch from da1d1a0 to 18563de Compare November 12, 2024 15:27
@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch 2 times, most recently from 4734e61 to e140018 Compare November 13, 2024 13:32
@github-actions github-actions bot removed the category: inference OpenVINO Runtime library - Inference label Nov 13, 2024
@EgorDuplensky
Copy link
Contributor Author

@maxnick Ready for review. Could you please take a look?

@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch from e140018 to d5bda07 Compare November 13, 2024 13:51
@EgorDuplensky
Copy link
Contributor Author

Added a fix for Convolution + Sum fallback graph.
Now such graph is also a part of memory reuse.

src/plugins/intel_cpu/src/compiled_model.cpp Outdated Show resolved Hide resolved
@@ -82,6 +82,7 @@ class Edge {
}

std::string name() const;
const MemoryDesc& getDesc() const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is private on purpose. As long as it's private, the node implementation may retrieve a tensor descriptor from edge only via the Memory object itself. But moving this method to public scope now introduces two ways of getting a memory descriptor, using getDesc and using the memory object. Of course these two approaches provide different memory descriptors, but making getDesc increases the code complexity by providing more degrees of freedom. Is it strictly necessary to extract this method to the public scope?

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added asserts, so this method can be called only before memory is allocated.
This is actually more robust way, than just using friend functionality and relying on the fact the graph is not going to use it in a wrong way.
Also, there is no way to prevent a developer from accessing all these data structures via chain of another existing public methods (getParent()->getSelectedPrimitiveDescriptor()->etc).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more robust only from the scope change standpoint. But it doesn't help the user of the interface to use the object in a particular way as it gives more degrees of freedom, which results in more possibilities of making a mistake.
It looks like this method has been moved to the public scope to be accessed from the auxiliary static functions, which are established to keep graph code more clear. Even though the downsides of changing the scope of this method outweigh the benefits, it's clear that it would be excessive to solve the problem of accessing the memory descriptor via special routines that would basically do the same thing as this method. So we have two solutions here

  1. Convert the static routines to private Graph methods, so that getDesc call being private still remains accessible as Graph is still a "friend" of Edge class
  2. Rename the method, so that it becomes clear for the user of the interface when to use it. "getDesc" -> "getCompileTimeDesc"/"getInitializationDesc" or something like that, just to help people to use the method properly and avoid mistakes.

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me these all just two different workarounds, neither better nor worse.
The only 'real' fix for this problem is to make selectedPrimitiveDescriptor related data structures temporary (which they actually are).
Collecting those descriptors, processing and applying them should be just an intermediate stage in a graph pipeline, like many other stages.
So, it should not be possible to access them simply because you don't have them at the inference stage at all.
And they should not be named "getCompileTimeDesc", because one will be able to access them only during CompileModel stage.
And still, currently nothing prevents the developer to access all these data structures using other public methods.
Nothing prevents Graph to access them at a wrong stage of the pipeline.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then let's help the interface users avoid unnecessary mistakes and give the method more specific name like "getOriginalDesc" or "getInitialDesc", or any other variate that will help to understand the purpose of this call.

src/plugins/intel_cpu/src/edge.h Outdated Show resolved Hide resolved
Comment on lines 69 to 76
/**
* Obsolete way of creating graph
* To enable layout propagation and global memory reuse
* two-stage creation should be used instead:
* - Init()
* - Allocate()
*/
template<typename NET>
void CreateGraph(NET &model, const GraphContext::CPtr context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we use OPENVINO_DEPRECATED("Obsolete way of creating graph") macro
  2. Why can't we simply remove these methods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just not 100% sure that we do not need some wrapper method for Init() + Activate() calls, because, again, I am not sure whether we will be able to utilize this two-step initialization in scope of CompiledModel / InferRequest, before we actually try to do so.
What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I can see, after the changes made in this PR, there remain only two places where CreateGraph is called: executor_factory.hpp and graph_emitter.hpp, both of which are just place holders for future functionality, so may be easily changed. Therefore, it looks like CreateGraph(NET &model, const GraphContext::CPtr context) can be just removed. However it may require a few changes to the unit tests. Actually may be done in a separate PR, I'm just afraid we simply forget about it.

Copy link
Contributor Author

@EgorDuplensky EgorDuplensky Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me rephrase it. I am not sure we don't want to keep a simplified version of the graph creation. When one (i.e. unit tests) does not want to call two functions explicitly, because one does not care about layout propagation and memory reuse.
From API point of view, it also seems fine to have two versions:

  1. Just create me a graph (CreateGraph()) using some default memory configuration
  2. Prepare this, apply that, setup something and then create me a more proper / optimized version of graph (Init() -> Activate())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are not sure, then just leave this method as it is. If we are sure that this is absolute, then remove it. As simple as that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the comments for CreateGraph method

src/plugins/intel_cpu/src/graph.h Outdated Show resolved Hide resolved
Comment on lines +300 to +332
virtual bool canBeSkipped() const {
return getSelectedPrimitiveDescriptor()->hasZeroInputDims();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name is rather vague. Where can it be skipped from? Apparently we need to change name to make it more clear.
What about isNop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this naming discussion to the end of the queue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is not the right time?

src/plugins/intel_cpu/src/nodes/composite.cpp Outdated Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/conv.cpp Show resolved Hide resolved
src/plugins/intel_cpu/src/nodes/lora.cpp Outdated Show resolved Hide resolved
@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch 2 times, most recently from 408e748 to 11a5115 Compare November 25, 2024 15:28
src/plugins/intel_cpu/src/graph.cpp Outdated Show resolved Hide resolved
Comment on lines 758 to 757
// @todo can we add some meaningful assert here?
for (auto &edge : cluster) {
if (edge->getParent()->isConstant() && edge->getStatus() == Edge::Status::NeedAllocation) {
allocateConstantEdge(edge);
}
}
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding, only the base edge may have status NeedAllocation, and it's already know as the result of the search above. Do we really need to iterate once again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it true for the constants as well? I was trying to replicate the logic of the original loop, where possible, to not break anything.
Will try to remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxnick Can we apply it for the String logic as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we can. The whole idea of the cluster is that it's a set of edges with the same memory block. And this memory block is located inside a specific memory object of a specific edge. So there can not be two edges with status NeedAllocation otherwise it's not clear what memory block should be shared across edges in the cluster.

src/plugins/intel_cpu/src/graph.cpp Outdated Show resolved Hide resolved
Comment on lines 118 to 123
const auto& subgraphInputMemory = subgraphInputNode->getDstMemoryAtPort(0);
auto mem = std::make_shared<Memory>(getEngine(), subgraphInputMemory->getDescPtr(), subgraphInputMemory->getMemoryBlock());
subgraphMemoryPtrs.push_back(mem);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we then simply store subgraphInputMemory in the subgraphMemoryPtrs in order to update it's memory descriptor during inference?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this is one of the ideas behind the workaround - to not use original memory.
Am I missing something?
Or you mean it is not applicable anymore based on this PR changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is, on master mem is used inside the subgraph: we create it first using the the memory block from the parent graph, store it in subgraphMemoryPtrs and then we put it to the subgraph via the Activate call. But in the PR we bind the edges, not the memory objects, so that they share the common memory block, but have different memory objects automatically (we don't need to create new memory objects for the subgraph input edges and put it to the subgraph via Activate). So the mem is simply stored in subgraphMemoryPtrs and share the same memory block, but it's not the same memory object from the subgraph. However, we still have to transfer the shape, as the internal subgraph input memory object is different from what is used in the parent graph. Someone needs to transfer the shape. Thus. Instead of creating the separate mem objects, we need to store the input memory objects of the internal subgraph to access them efficiently and update the shapes.

@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch 2 times, most recently from 7f3fadf to 9e170ae Compare December 10, 2024 17:56
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Dec 26, 2024
@mg-intel mg-intel removed the Stale label Jan 2, 2025
Copy link
Contributor

This PR will be closed in a week because of 2 weeks of no activity.

@github-actions github-actions bot added the Stale label Jan 17, 2025
@mg-intel mg-intel removed the Stale label Jan 17, 2025
@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch 9 times, most recently from 9e2648a to efb84d1 Compare January 21, 2025 19:24
Sync node indexes must be registered to global allocation context
in order.
@EgorDuplensky EgorDuplensky force-pushed the enable_memory_reuse_for_nested_graphs branch from af3a926 to 705eabf Compare January 25, 2025 22:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: build OpenVINO cmake script / infra category: CPU OpenVINO CPU plugin
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants