Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python client fails to start when receiving info or warning message from imported python modules #22

Closed
clavicule opened this issue Oct 27, 2016 · 17 comments

Comments

@clavicule
Copy link

hi @kengz!

(hopefully I am not mistaken) it looks like client.py will fail to start and return an error as soon as any message is received, regardless of the type of message (warning, info, not necessarily errors).

Here is how I got to this conclusion:

  1. I grabbed the dnn_titanic_train.py (actually all AI scripts too) from AIVA v3 and brought them over to my local clone of AIVA v4
  2. I installed all required python modules in my virtual env, especially tensorflow with GPU support

==> When starting the node app (npm start --debug), I get the following:
ERROR I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally

This actually not an error message, when starting a python console and running import tensorflow, several messages telling me the successfully loaded cuda libs. I can also run the MNIST training example from TF, etc.

So, to double-check it wasn't caused by the GPU-TF version, I made another virtual env, this time with the CPU tensorflow version. Then I got warning messages from tensorflow:
ERROR <...>/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 <...>, DeprecationWarning)
which I took care of by upgrading the dnn_titanic_train.py code so that it works on tensorflow v0.11RC1 (also DNNClassifier class has changed, etc.).

Then I got another tensorflow warning I couldn't get past:
ERROR WARNING:tensorflow:float64 is not supported by many models, consider casting to float32 which should be released in the official tensorflow v0.11 (link to issue).

In theory, to make it work, I would have to go back to tensorflow v0.9 or earlier (provided I don't encounter other warning messages). Also we can forget about the GPU tensorflow.

Bottomline, these messages are not error messages but are interpreted as such. Do you confirm this problem?

Thanks for your insight!

(btw, I'm on the cgkb branch but it shouldn't matter I think)

@kengz
Copy link
Owner

kengz commented Oct 28, 2016

Hi, a few guesses and suggestions:

On the terminal log, if you see the py client joining and not leaving - that indicates the core logic is working. Like so:

[Fri Oct 28 2016 03:33:38 GMT+0000 (UTC)] DEBUG js TLbm59Gc70vTpkTHAAAA joined, 3 remains
[Fri Oct 28 2016 03:33:41 GMT+0000 (UTC)] DEBUG py pGWlCSrxORy3YBjvAAAB joined, 2 remains

Otherwise, if it fails to join in time (you will see the log stating that), that means the python process needs more time to load. Check this issue and comment to see how to increase the wait time.

If the client.py can't even get started, the problem is inside the py process.

The py dependencies are installed inside a virtual env? The start-io.js script spawns a new child bash process from the main node process. If you don't get ImportError, that'd seems to be working, ok, check next.

If running the tensorflow code by itself in pure py (that is, just python <the tf script>.py) works, then that shd be ok too. Otherwise, it may throw exceptions that crashes the process, and that'd be the culprit. See next.

Just to test that tf is working from inside client.py, I've changed the lib/py/hello.py a little, like so:

# the ioid of this script for JSON payload 'from'
from os.path import basename
ioid = basename(__file__)  # 'hello.py'

# Write the module functions with the proper reply
##########################################

import tensorflow as tf

matrix1 = tf.constant([[3., 3.]])
matrix2 = tf.constant([[2.],[2.]])
product = tf.matmul(matrix1, matrix2)

def foo(input):
    with tf.Session() as sess:
        result = sess.run(product)
        print(result)
        # ==> [[ 12.]]
    return str(result)

# module method. Feel free to add more
def sayHi(msg):
    # the reply JSON payload.
    reply = {
        'output': foo(msg.get('input')),
        'to': msg.get('from'),
        'from': ioid,
        'hash': msg.get('hash')
    }
    # the py client will send this to target <to>
    return reply

Then, on starting aiva, and talking to it "hello-io py" it will return "[[ 12.]]". If this works too, good.

Overall:

  • run the tensorflow py process by itself without aiva;
  • run the modified hello.py in aiva
  • and make sure client.py doesn't crash and joins the server given enough time; test the script to make sure it replies
  • make sure the tensorflow in aiva runs the same with aiva as without aiva (otherwise might be a difference in virtual env)

Let me know at which point they fail.

@clavicule
Copy link
Author

Hi! Thank you for the quick reply. From reading your post, it looks like the culprit is the wait time (the py client doesn't join). I will try this weekend and report back (I'm away from my computer at this time :'(

On a side note, I didn't encounter any trouble making KB graph with spaCy while chatting with the bot (and it's an awesome piece of work you have accomplished!).

@kengz
Copy link
Owner

kengz commented Oct 28, 2016

Good to see that works for u :)
I'll leave this issue open till then, let me know if it's resolved and I'll close this.

@clavicule
Copy link
Author

sorry for getting back that late: it turns out it wasn't a timeout issue (although it's a very good parameter to keep in mind, thanks for sharing that):
it simply was a dumb mistake on my end... I was working on several branches at the same time and was - wrongly - 500% sure my TF script was properly running. There actually was a problem in the TF script making it to fail (wrong path to my NN model... duh!). Unfortunately the error messages caught by AIVA were the warning messages which mislead me to believe the TF script was alright.

I'm really sorry about that false alarm, I am closing the issue, apologies for opening it in the first place.
Thanks very much for your help!
cheers

@kengz
Copy link
Owner

kengz commented Oct 31, 2016

No worries :)

@clavicule
Copy link
Author

:/ Sorry to go back and forth, it looks like there is still an issue actually:
I made 2 python virtualenv that are identical (double-checked by comparing the pip freeze > req.txt files) with only one exception: one has CPU tensorflow installed, the other one has GPU tensorflow. I got the hello.py you wrote, the CPU tensorflow version gives me the right output:

[Mon Oct 31 2016 23:19:26 GMT+0000 (utc)] INFO Starting socketIO client for python3 at 6466
[Mon Oct 31 2016 23:19:26 GMT+0000 (utc)] DEBUG js XXX joined, 3 remains
[Mon Oct 31 2016 23:19:26 GMT+0000 (utc)] DEBUG Set TELEGRAM webhook url: https://xxxxxxx.ngrok.io:8443
[Mon Oct 31 2016 23:19:26 GMT+0000 (utc)] INFO Deploying bot x with adapter TELEGRAM
[Mon Oct 31 2016 23:19:27 GMT+0000 (utc)] DEBUG py XXX joined, 2 remains
[Mon Oct 31 2016 23:19:27 GMT+0000 (utc)] INFO Telegram Adapter Bot XXX:XXX Loaded...
[Mon Oct 31 2016 23:19:27 GMT+0000 (utc)] INFO Telegram Adapter Started...
[Mon Oct 31 2016 23:19:27 GMT+0000 (utc)] DEBUG Started global js socketIO client for TELEGRAM at 6466
[Mon Oct 31 2016 23:19:28 GMT+0000 (utc)] DEBUG global-client-js XXX joined, 1 remains
[Mon Oct 31 2016 23:19:29 GMT+0000 (utc)] INFO Telegram Bot Identified: x
[Mon Oct 31 2016 23:19:31 GMT+0000 (utc)] DEBUG cgkb-py XXX joined, 0 remains
[Mon Oct 31 2016 23:19:31 GMT+0000 (utc)] INFO All 4 IO clients have joined
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] INFO Receiving message_id: 67
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG IO on pass, msg: {
  "hash": "global-client-js_XXX",
  "from": "global-client-js",
  "input": "hello-io py",
  "to": "nlp.cgkb-py",
  "intent": "parse"
} fn: undefined
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG Added a callback to hasher for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG IO on pass, msg: {
  "hash": "global-client-js_XXX",
  "from": "global-client-js",
  "input": "Hello from user.",
  "to": "hello.py",
  "intent": "sayHi"
} fn: undefined
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hash string exists for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hasher.handle invoking cb for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hash string exists for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hasher.handle invoking cb for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG IO on pass, msg: {
  "to": "global-client-js",
  "from": "hello.py",
  "hash": "global-client-js_XXX",
  "output": "[[ 12.]]"
} fn: undefined
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hash string exists for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hasher.handle invoking cb for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hash string exists for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] DEBUG hasher.handle invoking cb for global-client
[Mon Oct 31 2016 23:19:42 GMT+0000 (utc)] INFO Sending message to room: x
[Mon Oct 31 2016 23:19:43 GMT+0000 (utc)] INFO Sending message to room: x

on the other hand the GPU tensorflow fails (info message treated as error):

[Mon Oct 31 2016 23:15:01 GMT+0000 (utc)] INFO Starting socketIO client for python3 at 6466
[Mon Oct 31 2016 23:15:01 GMT+0000 (utc)] DEBUG js XXX joined, 3 remains
[Mon Oct 31 2016 23:15:01 GMT+0000 (utc)] ERROR I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally

[Mon Oct 31 2016 23:15:01 GMT+0000 (utc)] DEBUG Set TELEGRAM webhook url: https://xxxxxxx.ngrok.io:8443
[Mon Oct 31 2016 23:15:01 GMT+0000 (utc)] INFO Deploying bot x with adapter TELEGRAM
[Mon Oct 31 2016 23:15:02 GMT+0000 (utc)] INFO Telegram Adapter Bot XXX:XXX Loaded...
[Mon Oct 31 2016 23:15:02 GMT+0000 (utc)] INFO Telegram Adapter Started...
[Mon Oct 31 2016 23:15:02 GMT+0000 (utc)] DEBUG Started global js socketIO client for TELEGRAM at 6466
[Mon Oct 31 2016 23:15:03 GMT+0000 (utc)] DEBUG global-client-js XXX joined, 2 remains
[Mon Oct 31 2016 23:15:04 GMT+0000 (utc)] INFO Telegram Bot Identified: x
[Mon Oct 31 2016 23:15:06 GMT+0000 (utc)] DEBUG cgkb-py XXX joined, 1 remains
[Mon Oct 31 2016 23:15:31 GMT+0000 (utc)] ERROR {}
[Mon Oct 31 2016 23:15:31 GMT+0000 (utc)] ERROR Expected number of IO clients failed to join in time
[Mon Oct 31 2016 23:15:31 GMT+0000 (utc)] INFO Shutting down
[Mon Oct 31 2016 23:15:31 GMT+0000 (utc)] INFO Exit: killed ioClient.js children
[Mon Oct 31 2016 23:15:31 GMT+0000 (utc)] INFO Exit: killed ioClient.js children
error: Forever detected script exited with code: 1

(note that here I had a timeout set to 30000 ms but I did it for 150000 ms and I got the same). If I run hello.py within the Python console, I correctly get the answer from foo function:

I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
>>> foo(0)
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:925] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties: 
name: GeForce GTX 970M
major: 5 minor: 2 memoryClockRate (GHz) 1.038
pciBusID 0000:01:00.0
Total memory: 2.94GiB
Free memory: 2.55GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0)
[[ 12.]]
'[[ 12.]]'

I am afraid to go back to warning/info messages from python treated as errors :S
what do you think?

@clavicule clavicule reopened this Oct 31, 2016
@kengz
Copy link
Owner

kengz commented Nov 1, 2016

Hmm that's quite curious. It seems in the latter case the python process just crashes and quits, because there's no py XXXX joined, ... after js has joined, and we can see the:

ERROR I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally

and later socketIO server just shuts it down as expectedly as it knows the py client is missing.

So, we can focus on that. Perhaps it's the python socketIO client misbehaving with the GPU TF. One way to see is to actually run python client.py that lives inside the lib/. It imports the scripts from the lib/py/ folder here, then starts the io client and leaves it open at here.

Let me know if running that causes it to explode.

@clavicule
Copy link
Author

it is getting stranger by the day:
running python client.py gives the same behavior on both CPU and GPU tensorflow:

  • from py import * is successful (in the case of GPU tensorflow, we see the 4 CUDA librairies successfully loaded successfully opened CUDA library XXX.so locally)
  • client.wait() is reached and the script indefinitely waits

I ran another experiment, this time not importing tensorflow but from sklearn import cross_validation which is supposed to give the following warning message:

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

AIVA has no problem starting, and doesn't consider it as an error. I'm puzzled :/
So, my assumption of warning messages being considered as errors doesn't seem to be correct. Somehow the GPU tensorflow is making the python process to crash, but only within the AIVA environment.

I will try environment with previous tensorflow version to see if it makes any difference, I doubt it but better double-check.

@kengz
Copy link
Owner

kengz commented Nov 1, 2016

Hmm strange indeed. I have a few ideas:

  1. The aiva logger is in nodejs. It catches the piped stdout from the spawned python process, so it seems this python process goes nuts. And so the logger catches that stderr string from it, logs the message, then the python process quits.
  2. That spawned python process might or might not have caught on the virtual env? Maybe it's worth trying to modify the script in start-io.js (find the python3 command) and modify with to start with a target virtualenv explicitly. This may explain why it works when running client.py separately, but fails when ran from within aiva

On Nov 1, 2016, 10:40 AM -0400, clavicule [email protected], wrote:

it is getting stranger by the day:
running python client.py gives the same behavior on both CPU and GPU tensorflow:

from py import * is successful (in the case of GPU tensorflow, we see the 4 CUDA librairies successfully loaded successfully opened CUDA library XXX.so locally)
client.wait() is reached and the script indefinitely waits

I ran another experiment, this time not importing tensorflow but from sklearn import cross_validation which is supposed to give the following warning message:

DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)

AIVA has no problem starting, and doesn't consider it as an error. I'm puzzled :/
So, my assumption of warning messages being considered as errors doesn't seem to be correct. Somehow the GPU tensorflow is making the python process to crash, but only within the AIVA environment.

I will try environment with previous tensorflow version to see if it makes any difference, I doubt it but better double-check.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (#22 (comment)), or mute the thread (https://github.com/notifications/unsubscribe-auth/AH1Db1BHaar-rf3ols4EgVrnUED37uCZks5q509QgaJpZM4KidxV).

@clavicule
Copy link
Author

You are probably onto something with this idea. I modified the src/start-io.js so that it launches the python3 of my virtualenv:

    lang = _.replace(lang, 'python3', PYPATH)
    global.log.info(`AYA Starting socketIO client for ${lang} at ${process.env.IOPORT}`) // you'll see below why I put AYA as a marker

and I got this output:

[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] DEBUG import js lib from client.js
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] INFO Starting socketIO client for js at 6466
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] INFO AYA Starting socketIO client for /path/to/virtualenv/python at 6466
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] DEBUG Set TELEGRAM PORT: 8443
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] DEBUG ngrok options: {
  "proto": "http",
  "addr": 8443,
  "authtoken": "XXX"
}
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] INFO Starting socketIO client for python3 at 6466
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] ERROR Traceback (most recent call last):
  File "/path/to/aiva/node_modules/spacy-nlp/src/client.py", line 12, in <module>
    import websocket
ImportError: No module named 'websocket'

[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] DEBUG js XXX joined, 3 remains
[Tue Nov 01 2016 16:14:30 GMT+0000 (utc)] ERROR An incompatible websocket library is conflicting with the one we need.
You can remove the incompatible library and install the correct one

Note that the client.py is properly started with the correct python env:
AYA Starting socketIO client for /path/to/virtualenv/python at 6466

but then we see this one:
Starting socketIO client for python3 at 6466

meaning it's starting a new python process, using the default system env (which doesn't have modules installed, hence the websocket import error). So the question is: where is this new python instruction starter coming from?
I searched the entire AIVA directory for a logging that is Starting socketIO client for (and other variations) but couldn't find anything except the explicit Starting socketIO client for js at here.
I looked at other usage of the variable ioClientCmds without success either.

It must be coming from somewhere right?

@clavicule
Copy link
Author

More test results:

  • I installed tensorflow and dependencies on my default python system env, I am not within any virtualenv
  • I freshly checked out the AIVA master branch
  • I edited lib/py/hello.py to add import tensorflow as tf before foo() function definition (I am not even script any TF instruction, just importing it).

I get the same:
ERROR I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally

Next test: earlier version of tensorflow :)

@clavicule
Copy link
Author

last test:

  • installed tensorflow v0.9 (also installed CUDA-7.5 and CUDNN v4.0 to be compatible with TF v0.9)
  • updated $PATH, $CUDA_HOME and $LD_LIBRARY_PATH to make sure they point to the 7.5 CUDA toolkit when tensorflow is running
  • run basic tensorflow script on GPU: no problem
  • tried again npm start --debug

==> same issue:
ERROR I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally

I am not sure how to go from there, do you have any other idea? Do you also encounter the problem on your end? I assumed from the beginning it could not be a problem with the machine (since I can train and infer tensorflow CNN on GPU) but maybe it is?

I can still run the CPU-version of tensorflow, which is largely enough to test interactions between tensorflow and AIVA but I assume that eventually it could be a problem for others too if they try to run GPU intensive tensorflow script, right?

I guess an alternative would be to run tensorflow script as a service on other machine (e.g. AWS) and make the call from AIVA over https-request, would that be a good practice approach in your opinion?

@kengz
Copy link
Owner

kengz commented Nov 3, 2016

Yeah I haven't done it with such heavy tensorflow process yet. I'm afraid that once CUDA gets involved there'll be a lot more going on, and for example the environment variables dont get passed into the spawned bash process from node -> CUDA/the GPU TF backend might go crazy as it will miss some things.

Ohh I got an idea that will likely work well.

  • Deactivate python in default.json so it doesn't run there.
  • Run AIVA independently witnout the python client, wait till it says everything socketIO etc is ready
  • Run client.py independently (that is, manually do python client.py after your aiva (and the server) is ready. It will join the server just fine (careful with the port if u change it)

Give it a try!

@clavicule
Copy link
Author

yay! It works! Thanks!
It confirms your theory of spawn child process having troubles with CUDA.
So the workaround is to have the python client to join separately on his own after AIVA has been started.

@kengz
Copy link
Owner

kengz commented Nov 4, 2016

Awesome, glad to know. I'll close this, but will find a new way around the nodejs spawn shell process so this doesn't happen again. Thanks for discovering the issue!

@kengz kengz closed this as completed Nov 4, 2016
@clavicule
Copy link
Author

thanks a lot for working through it!

@ryanblakeley
Copy link

I'm running into something similar, but the proposed solution didn't work for me.

When I follow install and setup instructions exactly and run npm start, there is this part that looks like two python clients are running:

[Tue Mar 27 2018 20:41:04 GMT+0000 (UTC)] INFO Starting socketIO client for js at 6466
[Tue Mar 27 2018 20:41:04 GMT+0000 (UTC)] INFO Starting socketIO client for python at 6466
[Tue Mar 27 2018 20:41:04 GMT+0000 (UTC)] INFO Starting socketIO client for python at 6466
...
File "/Users/ryanblakeley/Code/aiva/node_modules/spacy-nlp/src/client.py", line 12, in <module>
    import websocket
ImportError: No module named websocket

When I flip the python client to false in default.json that same log section looks like:

[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] INFO Starting socketIO client for js at 6466
[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] INFO Starting socketIO client for python at 6466
[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] INFO Deploying bot aivadev with adapter SLACK
[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] INFO js fAtBTrvCivRxSSSCAAAA joined, 2 remains
[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] ERROR Traceback (most recent call last):
  File "/Users/ryanblakeley/Code/aiva/node_modules/spacy-nlp/src/client.py", line 12, in <module>
    import websocket

[Tue Mar 27 2018 20:55:49 GMT+0000 (UTC)] ERROR ImportError: No module named websocket

So the duplicate python client line is gone but the dependency error is still present.

As mentioned in #45, python3 -m pip show websocket looks like a positive result:

Name: websocket
Version: 0.2.1
Summary: Websocket implementation for gevent
Home-page: http://pypi.python.org/pypi/websocket
Author: UNKNOWN
Author-email: UNKNOWN
License: UNKNOWN
Location: /usr/local/lib/python3.6/site-packages
Requires: greenlet, gevent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants