Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Support lazy tensor allocation (#193)
Summary: Support lazy tensor allocation The current algorithm to allocate tensors in et_replay is to find out the tensors that can not be generated when all ops are replayed, then pre-allocate them before replay starts and keeps them between each iterations. However, this algorithm leads to OOM when replaying Llama4 70B model. This PR introduced TensorAllcationMode class TensorAllcationMode(Enum): """ Enum to represent the tensor allocation mode """ # Allocate input tensors that can not be generated when replaying the trace # at the beginning and reuse them for all iterations. PRE_ALLOCATE = 1 # Allocate tensors on the fly and free them after they are out of scope LAZY_ALLOCATE = 2 For LAZY_ALLOCATE mode, tensors are kept in tensor_storage_map and tensor_registry, and have replay_tensor_id_to_last_node_id_map and tensor_storage_id_to_last_node_id_map to track the last node id to access to the tensor and tensor storage. If the replay passes the last node, tensor or tensor_storage will be deleted appropriately. The DIFF also introduced another option --device-memory-threshold. With LAZY_ALLOCATE, this option will free all tensors when the ratio between the allocated device memory and the total device memory is greater than device-memory-threshold. It can keep replay running with the overhead of freeing and allocating memory. Llama4 7B does not need this option when ET is captured with unique storage id (https://www.internalfb.com/diff/D66849516) This fixed OOM issue in Llama4 70B. Reviewed By: sanrise Differential Revision: D66487952
- Loading branch information