postgresMgr.js null dereference #10

AmlingPalantir · 2015-09-18T22:36:53Z

Observed in (a private branch based on) 99fd1bf "Uncaught TypeError: Cannot call method 'removeAllListeners' of null" on line analogous to line 1634 in 99fd1bf.

Admittedly we are on a private branch with other changes and it is based pretty far back in history but I believe the problem is still relevant.

Having plastered the code with debug logging and reproduced, the basic order of events is that that query error handler happens, nulls out _pgClient, and then that client error handler happens and crashes.

The text was updated successfully, but these errors were encountered:

davepacheco · 2015-09-18T22:39:28Z

How reproducible is this problem for you? Are you (or can you) run with --abort-on-uncaught-exception and share the resulting core file?

AmlingPalantir · 2015-09-18T22:41:15Z

Uh, pretty reproducible (as in I can make it happen again with stress testing, not that it happens often). Due to the vagaries of how we're running it I'm slightly reticent to go down this path, especially when I think the nature of the crash is relatively clear from the debug logging I had added (both error handlers are called and in an order that makes the second one crash).

davepacheco · 2015-09-18T22:47:14Z

I think I understand what you're saying, but without understanding this better, I'm not sure what the most appropriate fix is. (I don't want to blindly add a defensive check that papers over other potential problems.) Do you happen to still have the debug output, or anything about which errors were emitted? Was it the same underlying error that was reported on both the query and the client?

Thanks again for reporting this.

AmlingPalantir · 2015-09-18T22:53:45Z

I super hate to be that guy, but I'm not sure what I can or can't share so I'm in a weird spot. At the least the error the query handler logged was:

{"name":"manatee-sitter","hostname":"XXX","pid":126104,"component":"PostgresMgr","level":40,"err":{"message":"terminating connection due to administrator command","name":"error","stack":"error: terminating connection due to administrator command\n at Connection.parseE (PATH/manatee/node_modules/pg/lib/connection.js:521:11)\n at Connection.parseMessage (PATH/manatee/node_modules/pg/lib/connection.js:351:17)\n at Socket. (PATH/manatee/node_modules/pg/lib/connection.js:101:22)\n at Socket.emit (events.js:95:17)\n at Socket. (stream_readable.js:765:14)\n at Socket.emit (events.js:92:17)\n at emitReadable (_stream_readable.js:427:10)\n at emitReadable (_stream_readable.js:423:5)\n at readableAddChunk (_stream_readable.js:166:9)\n at Socket.Readable.push (_stream_readable.js:128:10)","code":"57P01"},"msg":"got err","time":"2015-09-17T22:49:38.024Z","src":{"file":"PATH/manatee/lib/postgresMgr.js","line":1675,"func":"PostgresMgr._pgQueryFini"},"v":0}

The error the client error handler logged (right before dying) was:

{"name":"manatee-sitter","hostname":"XXX","pid":126104,"component":"PostgresMgr","level":50,"err":{"message":"read ECONNRESET","name":"Error","stack":"Error: read ECONNRESET\n at errnoException (net.js:905:11)\n at TCP.onread (net.js:559:19)","code":"ECONNRESET"},"msg":"got pg client error","time":"2015-09-17T22:49:38.026Z","src":{"file":"PATH/manatee/lib/postgresMgr.js","line":1615},"v":0}

This was immediately after it decided it was going to stop postgres to assume the role of sync (last normal line above first error was the "trying SIGINT" message).

davepacheco · 2015-09-18T22:57:19Z

Thanks. Interesting that they were two different errors.

AmlingPalantir · 2015-09-18T23:10:22Z

Sorry to be so sparing with the actual data, but the powers that be move
slowly. Here is an extremely minimal slice of that debugging log (just time
and msg, and maybe a minor comment of mine in square brackets):

2015-09-17T22:49:36.019Z _pgQueryKick dispatching request
2015-09-17T22:49:36.019Z start query 'end' callback
2015-09-17T22:49:36.019Z start _pgQueryFini
2015-09-17T22:49:36.020Z finish _pgQueryFini
2015-09-17T22:49:36.020Z end query 'end' callback
2015-09-17T22:49:37.021Z _pgQueryKick dispatching request
2015-09-17T22:49:37.021Z start query 'end' callback
2015-09-17T22:49:37.022Z start _pgQueryFini
2015-09-17T22:49:37.022Z finish _pgQueryFini
2015-09-17T22:49:37.022Z end query 'end' callback
2015-09-17T22:49:38.012Z incoming event [event was activeChange]
2015-09-17T22:49:38.019Z incoming event [event was clusterStateChange]
2015-09-17T22:49:38.020Z assuming role [role was sync]
2015-09-17T22:49:38.020Z PostgresMgr.reconfigure: entering
2015-09-17T22:49:38.021Z PostgresMgr._standby: entering
2015-09-17T22:49:38.021Z PostgresMgr.initDb: stop postgres
2015-09-17T22:49:38.021Z PostgresMgr.stop: entering
2015-09-17T22:49:38.021Z PostgresMgr.stop: trying SIGINT
2015-09-17T22:49:38.022Z _pgQueryKick dispatching request
2015-09-17T22:49:38.023Z start query 'error' callback
2015-09-17T22:49:38.023Z start _pgQueryFini
2015-09-17T22:49:38.024Z got err [this was the first error I pasted above]
2015-09-17T22:49:38.025Z _pgQueryFini: nuking _pgClient due to error
2015-09-17T22:49:38.026Z finish _pgQueryFini
2015-09-17T22:49:38.026Z finish query 'error' callback
2015-09-17T22:49:38.026Z start client error callback
2015-09-17T22:49:38.026Z got pg client error [this was the second error I pasted above]

AmlingPalantir · 2015-10-01T20:30:50Z

I put up a sketch of a rewrite of some of the state management which fixes the problem, for values of "fixes" equal to "based on my limited understanding of javascript" and "could no longer repro crash".

I got the go-ahead to post a core dump so if you'd still rather see that I guess I can remove the proposed fix locally, reproduce again, etc. I guess let me know either way.

davepacheco · 2015-10-06T17:58:24Z

Thanks for taking a swing at fixing this. I'm not sure I understand the new invariants, though. Before the change, the invariants were that there was only ever one active client, and it was represented by self._pgClient. There was only ever one outstanding request, and it's represented by self._pgRequestOutstanding. Your change obviously assumes that it's possible to get a client's error callback fired even after it's no longer the only client. Is that what happened in your test case? The change also assumes that when a query ends, it may not be the currently outstanding request. How can that happen?

At 1645 in the new file, it seems like you leak the 'error' listener with each new query. If we issue a request that succeeds, but there's a subsequent client error (say, 10 requests later), won't we fire 'error' callbacks for all of the successful requests that happened before it?

Thanks again for taking a look. I know this code is rather hairy, and it's also tricky to get right. (It could also be made a lot simpler with a fix for node-postgres#718.)

AmlingPalantir · 2015-10-06T19:14:31Z

Thanks for taking a swing at fixing this. I'm not sure I understand the new
invariants, though. Before the change, the invariants were that there was
only ever one active client, and it was represented by self._pgClient. There
was only ever one outstanding request, and it's represented by
self._pgRequestOutstanding. Your change obviously assumes that it's possible
to get a client's error callback fired even after it's no longer the only
client. Is that what happened in your test case?

Uh, I am not strictly sure whether or not I observed that. I chose
"self._pgClient !== selfClient" over "self._pgClient !== null" because I think
it's going to be safer and more resilient (either to mistakes or to future
mucking in this code).

The change also assumes that when a query ends, it may not be the currently
outstanding request. How can that happen?

Again, I don't have an execution path that leads to that. Maybe client errors
(killing request) but query actually then completes? I don't know if that's
possible due to extreme lack of knowledge of what the postgres library
guarantees or doesn't. As I implied earlier, most choices in what I wrote are
informed by fear and desire for resilience in the face of uncertainty (about
what current libraries do, about what people might change surrounding code to
do, etc.).

At 1645 in the new file, it seems like you leak the 'error' listener with
each new query. If we issue a request that succeeds, but there's a subsequent
client error (say, 10 requests later), won't we fire 'error' callbacks for
all of the successful requests that happened before it?

Uh, sure does seem that way. At the least _pgQueryFini won't do anything
stupid since it will check the outstanding request and log the error as
spurious (presumably several times). Maybe the other two handlers (query error
and query completion) should remove that one listener specifically? Let me go
actually learn enough JS to figure out how to do that...

AmlingPalantir · 2015-10-09T20:07:40Z

Updated the PR (now that the new cut has been through a few nights of not breaking).

AmlingPalantir mentioned this issue Sep 30, 2015

Fix PG client/request queue state management. #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

postgresMgr.js null dereference #10

postgresMgr.js null dereference #10

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

AmlingPalantir commented Oct 1, 2015

davepacheco commented Oct 6, 2015

AmlingPalantir commented Oct 6, 2015

AmlingPalantir commented Oct 9, 2015

postgresMgr.js null dereference #10

postgresMgr.js null dereference #10

Comments

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

davepacheco commented Sep 18, 2015

AmlingPalantir commented Sep 18, 2015

AmlingPalantir commented Oct 1, 2015

davepacheco commented Oct 6, 2015

AmlingPalantir commented Oct 6, 2015

AmlingPalantir commented Oct 9, 2015