Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss #4

Vimos · 2018-05-02T12:28:07Z

If using the default maxSeqLen, one will get the cublas runtime error

➜  RelationRNN git:(master) ✗ th train_rel_rnn.lua               
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:11:11] - "    number of batch : 296"
[INFO - 2018_05_02_20:11:11] - "    data batch size : 256"
[INFO - 2018_05_02_20:11:11] - "    neg sample size : 1024"
[INFO - 2018_05_02_20:11:11] - "    neg sample range: 7524"
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:11] - "    inputSize   :   300"
[INFO - 2018_05_02_20:11:11] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:11:11] - "    maxSeqLen   :    40"
[INFO - 2018_05_02_20:11:11] - "    maxBatch    :   256"
[INFO - 2018_05_02_20:11:11] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:11] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:11] - "    inputSize   :   512"
[INFO - 2018_05_02_20:11:11] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:11:11] - "    maxSeqLen   :    40"
[INFO - 2018_05_02_20:11:11] - "    maxBatch    :   256"
/home/vimos/.torch/install/bin/luajit: /home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:67: 
In 5 module of nn.Sequential:
/home/vimos/Data/git/QA/CFO/src/model/BiGRU.lua:241: cublas runtime error : an internal operation failed at /home/vimos/.torch/extra/cutorch/lib/THC/THCBlas.cu:246
stack traceback:
	[C]: in function 'mm'
	/home/vimos/Data/git/QA/CFO/src/model/BiGRU.lua:241: in function 'updateGradInput'
	/home/vimos/.torch/install/share/lua/5.1/nn/Module.lua:31: in function </home/vimos/.torch/install/share/lua/5.1/nn/Module.lua:29>
	[C]: in function 'xpcall'
	/home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
	/home/vimos/.torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
	train_rel_rnn.lua:174: in main chunk
	[C]: in function 'dofile'
	...mos/.torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x559ae9bad710

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
	[C]: in function 'error'
	/home/vimos/.torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
	/home/vimos/.torch/install/share/lua/5.1/nn/Sequential.lua:84: in function 'backward'
	train_rel_rnn.lua:174: in main chunk
	[C]: in function 'dofile'
	...mos/.torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
	[C]: at 0x559ae9bad710
THCudaCheckWarn FAIL file=/home/vimos/.torch/extra/cutorch/lib/THC/THCStream.cpp line=50 error=77 : an illegal memory access was encountered
THCudaCheckWarn FAIL file=/home/vimos/.torch/extra/cutorch/lib/THC/THCStream.cpp line=50 error=77 : an illegal memory access was encountered

But this problem can be fixed by using a larger maxSeqLen

➜  RelationRNN git:(master) ✗ th train_rel_rnn.lua -maxSeqLen 42
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:11:52] - "    number of batch : 296"
[INFO - 2018_05_02_20:11:52] - "    data batch size : 256"
[INFO - 2018_05_02_20:11:52] - "    neg sample size : 1024"
[INFO - 2018_05_02_20:11:52] - "    neg sample range: 7524"
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:52] - "    inputSize   :   300"
[INFO - 2018_05_02_20:11:52] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:11:52] - "    maxSeqLen   :    42"
[INFO - 2018_05_02_20:11:52] - "    maxBatch    :   256"
[INFO - 2018_05_02_20:11:52] - "--------------------------------------------------"
[INFO - 2018_05_02_20:11:52] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:11:52] - "    inputSize   :   512"
[INFO - 2018_05_02_20:11:52] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:11:52] - "    maxSeqLen   :    42"
[INFO - 2018_05_02_20:11:52] - "    maxBatch    :   256"
[INFO - 2018_05_02_20:11:56] - "iter  100, loss = 0.00198258"........] ETA: 3h29m | Step: 42ms       
[INFO - 2018_05_02_20:12:00] - "iter  200, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms       
[INFO - 2018_05_02_20:12:04] - "epoch   1, loss 0.00066979"..........] ETA: 3h28m | Step: 42ms       
[INFO - 2018_05_02_20:12:04] - "iter  300, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms       
[INFO - 2018_05_02_20:12:09] - "iter  400, loss = 0.00000000"........] ETA: 3h28m | Step: 42ms       
[INFO - 2018_05_02_20:12:13] - "iter  500, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms       
[INFO - 2018_05_02_20:12:17] - "epoch   2, loss 0.00000000"..........] ETA: 3h26m | Step: 41ms       
[INFO - 2018_05_02_20:12:17] - "iter  600, loss = 0.00000000"........] ETA: 3h26m | Step: 41ms       
[INFO - 2018_05_02_20:12:21] - "iter  700, loss = 0.00000000"........] ETA: 3h28m | Step: 42ms       
[INFO - 2018_05_02_20:12:25] - "iter  800, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms       
[INFO - 2018_05_02_20:12:29] - "epoch   3, loss 0.00000000"..........] ETA: 3h25m | Step: 41ms       
[INFO - 2018_05_02_20:12:30] - "iter  900, loss = 0.00000000"........] ETA: 3h25m | Step: 41ms       
[INFO - 2018_05_02_20:12:34] - "iter 1000, loss = 0.00000000"........] ETA: 3h27m | Step: 42ms

But the loss will be 0 after the 1 epoch or become infinity

➜  RelationRNN git:(master) ✗ th train_rel_rnn.lua -maxSeqLen 42 -seed 12
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "SeqRankingLoader Configurations:"
[INFO - 2018_05_02_20:26:49] - "    number of batch : 296"
[INFO - 2018_05_02_20:26:49] - "    data batch size : 256"
[INFO - 2018_05_02_20:26:49] - "    neg sample size : 1024"
[INFO - 2018_05_02_20:26:49] - "    neg sample range: 7524"
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:26:49] - "    inputSize   :   300"
[INFO - 2018_05_02_20:26:49] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:26:49] - "    maxSeqLen   :    42"
[INFO - 2018_05_02_20:26:49] - "    maxBatch    :   256"
[INFO - 2018_05_02_20:26:49] - "--------------------------------------------------"
[INFO - 2018_05_02_20:26:49] - "BiGRU Configuration:"
[INFO - 2018_05_02_20:26:49] - "    inputSize   :   512"
[INFO - 2018_05_02_20:26:49] - "    hiddenSize  :   256"
[INFO - 2018_05_02_20:26:49] - "    maxSeqLen   :    42"
[INFO - 2018_05_02_20:26:49] - "    maxBatch    :   256"
[INFO - 2018_05_02_20:26:53] - "iter  100, loss = 81231552070126006809284050944.00000000" 41ms       
[INFO - 2018_05_02_20:26:57] - "iter  200, loss = 0.00000000"........] ETA: 3h15m | Step: 39ms       
[INFO - 2018_05_02_20:27:01] - "epoch   1, loss 27443091915583111597203128320.00000000"p: 40ms       
[INFO - 2018_05_02_20:27:01] - "iter  300, loss = 0.00000000"........] ETA: 3h17m | Step: 40ms

The text was updated successfully, but these errors were encountered:

ThisIsSoMe · 2019-02-25T12:37:45Z

I have not ran the codes,but I wanna know how many available data(subject mention could be found in the question) in the train(75910)&test(21678) can I get after preprocessing.Would you mind solving my problem？And I will be appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss #4

Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss #4

Vimos commented May 2, 2018

ThisIsSoMe commented Feb 25, 2019

Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss #4

Issues for RelationRNN training, maxSeqLen and zero loss or infinity loss #4

Comments

Vimos commented May 2, 2018

ThisIsSoMe commented Feb 25, 2019