Ubuntu 20.04 LTS, NVIDIA 3070 GPU (Driver 510.85.02, CUDA Version 11.6)
I am able to run the example as is and it trains successfully but it is very slow and appears to not be fully utilizing all the cores on my cpu. However at what appears to be the end of Epoch 2 (Last progress printout reports Iteration 80 Epoch 2/6, with 2 full bars) it crashes with this message:
thread 'main' panicked at 'called Result::unwrap()
on an Err
value: SendError { .. }', burn/burn/src/train/checkpoint/async_checkpoint.rs:68:40
I changed the example to use the Tch backend by changing main to this:
fn main() {
use burn::tensor::backend::TchADBackend;
let device = TchDevice::Cpu;
training::run::<TchADBackend<f32>>(device);
println!("Done.");
}
Which appeares to train using my full Cpu at a great speeds but then crashed both tries in 2 different ways. The first is the same message as above and upon using the vscode debugger it crashed in a different way:
thread '' panicked at 'attempt to subtract with overflow', burn/burn/src/train/checkpoint/file.rs:41:60
In that case epoch was 1 and self.num_keep was 2
I changed the example main as follows to try to use my GPU:
fn main() {
use burn::tensor::backend::TchADBackend;
let device = TchDevice::Cuda(0);
training::run::<TchADBackend<f32>>(device);
println!("Done.");
}
My first question is what does the magic number in TchDevice::Cuda(XXX) represent?
Then even with various numbers for that value (0, 1, 1024) the application crashes on the line model.to_device(device);
I always get this error message which I have been unable to solve:
thread 'main' panicked at 'called Result::unwrap()
on an Err
value: Torch("Could not run 'aten::empty_strided' with arguments from the 'CUDA' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::empty_strided' is only available for these backends: [Dense, Conjugate, Negative, UNKNOWN_TENSOR_TYPE_ID, QuantizedXPU, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseCPU, SparseCUDA, SparseHIP, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, SparseXPU, UNKNOWN_TENSOR_TYPE_ID, SparseVE, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, NestedTensorCUDA, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID, UNKNOWN_TENSOR_TYPE_ID].\n\nCPU: registered at aten/src/ATen/RegisterCPU.cpp:37386 [kernel]\nMeta: registered at aten/src/ATen/RegisterMeta.cpp:31637 [kernel]\nQuantizedCPU: registered at aten/src/ATen/RegisterQuantizedCPU.cpp:1294 [kernel]\nBackendSelect: registered at aten/src/ATen/RegisterBackendSelect.cpp:726 [kernel]\nPython: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:133 [backend fallback]\nNamed: registered at ../aten/src/ATen/core/NamedRegistrations.cpp:7 [backend fallback]\nConjugate: fallthrough registered at ../aten/src/ATen/ConjugateFallback.cpp:22 [kernel]\nNegative: fallthrough registered at ../aten/src/ATen/native/NegateFallback.cpp:22 [kernel]\nZeroTensor: fallthrough registered at ../aten/src/ATen/ZeroTensorFallback.cpp:90 [kernel]\nADInplaceOrView: fallthrough registered at ../aten/src/ATen/core/VariableFallbackKernel.cpp:64 [backend fallback]\nAutogradOther: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradCPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradCUDA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nUNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradXLA: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradMPS: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradIPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradXPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradHPU: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nUNKNOWN_TENSOR_TYPE_ID: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradLazy: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse1: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse2: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nAutogradPrivateUse3: registered at ../torch/csrc/autograd/generated/VariableType_2.cpp:14210 [autograd kernel]\nTracer: registered at ../torch/csrc/autograd/generated/TraceType_2.cpp:14069 [kernel]\nAutocastCPU: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:481 [backend fallback]\nAutocast: fallthrough registered at ../aten/src/ATen/autocast_mode.cpp:324 [backend fallback]\nBatched: registered at ../aten/src/ATen/BatchingRegistrations.cpp:1064 [backend fallback]\nVmapMode: fallthrough registered at ../aten/src/ATen/VmapModeRegistrations.cpp:33 [backend fallback]\nFunctionalize: registered at ../aten/src/ATen/FunctionalizeFallbackKernel.cpp:89 [backend fallback]\nPythonTLSSnapshot: registered at ../aten/src/ATen/core/PythonFallbackKernel.cpp:137 [backend fallback]\n\nException raised from reportError at ../aten/src/ATen/core/dispatch/OperatorEntry.cpp:447 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f95aa2a79cb in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libc10.so)\nframe #1: c10::impl::OperatorEntry::reportError(c10::DispatchKey) const + 0x36b (0x7f95ab5e252b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #2: + 0x1b4df9b (0x7f95abe40f9b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #3: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRef, c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0xac (0x7f95ac011e6c in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #4: + 0x1fac735 (0x7f95ac29f735 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #5: at::_ops::empty_strided::call(c10::ArrayRef, c10::ArrayRef, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x174 (0x7f95ac054114 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #6: at::empty_strided(c10::ArrayRef, c10::ArrayRef, c10::TensorOptions) + 0xd8 (0x55f15452c2a8 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #7: at::native::_to_copy(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x1447 (0x7f95aba2cf97 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #8: + 0x21479e3 (0x7f95ac43a9e3 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #9: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x10d (0x7f95abd9d78d in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #10: + 0x1faef51 (0x7f95ac2a1f51 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #11: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x10d (0x7f95abd9d78d in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #12: + 0x2fd82be (0x7f95ad2cb2be in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #13: + 0x2fd883b (0x7f95ad2cb83b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #14: at::_ops::_to_copy::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x202 (0x7f95abe1a1e2 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #15: at::native::to(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x13e (0x7f95aba22dde in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #16: + 0x2251799 (0x7f95ac544799 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #17: at::_ops::to_dtype_layout::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, bool, c10::optionalc10::MemoryFormat) + 0x216 (0x7f95abf47b26 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/build/torch-sys-d8a9710e31a4996b/out/libtorch/libtorch/lib/libtorch_cpu.so)\nframe #18: at::Tensor::to(c10::TensorOptions, bool, bool, c10::optionalc10::MemoryFormat) const + 0xf0 (0x55f1545286e4 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #19: + 0x247491 (0x55f154531491 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #20: + 0x225035 (0x55f15450f035 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #21: + 0x226137 (0x55f154510137 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #22: + 0xdaf55 (0x55f1543c4f55 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #23: + 0xaa848 (0x55f154394848 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #24: + 0x9ec37 (0x55f154388c37 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #25: + 0x15bf7e (0x55f154445f7e in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #26: + 0x114627 (0x55f1543fe627 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #27: + 0x15b097 (0x55f154445097 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #28: + 0x1304b7 (0x55f15441a4b7 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #29: + 0x15b81b (0x55f15444581b in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #30: + 0x12f1b7 (0x55f1544191b7 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #31: + 0x11c5d6 (0x55f1544065d6 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #32: + 0xa0170 (0x55f15438a170 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #33: + 0xb54cb (0x55f15439f4cb in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #34: + 0x130afe (0x55f15441aafe in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #35: + 0x133c81 (0x55f15441dc81 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #36: + 0x34b21f (0x55f15463521f in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #37: + 0x133c5a (0x55f15441dc5a in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #38: + 0xa01d1 (0x55f15438a1d1 in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\nframe #39: __libc_start_main + 0xf3 (0x7f95a9eb9083 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #40: + 0x7962e (0x55f15436362e in /home/matthew/Projects/Rust/machine_learning/burn/target/debug/mnist)\n")', /home/matthew/.cargo/registry/src/github.com-1ecc6299db9ec823/tch-0.8.0/src/wrappers/tensor_generated.rs:12977:27