-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lib/client.js this.activeQuery sometimes null #3174
Comments
First of all: sorry that's happening. That's troubling to hear. Do you have any self contained way to reproduce the issue or more about your environment you could describe? Having a way to reproduce this would be wonderful as I'm sure its an easy fix when it comes down to it. That code is fine to completely skip or comment out on a fork of your own. I doubt you're even using the Are you using pg-bouncer? I'll take a look at the code a bit more in the coming weeks & keep this issue in mind. I'm in the process of spinning down my full-time job right now so I can focus more on open source, starting early April! |
Hello again and thanks for the quick reply!! After further review of past 24 hours we have also found another seemingly related trace:
This seems to be pointing to: node-postgres/packages/pg/lib/client.js Line 382 in 91de4b9
Another optimistic edit would be change this: _handleCommandComplete(msg) {
// delegate commandComplete to active query
this.activeQuery.handleCommandComplete(msg, this.connection)
} To this: _handleCommandComplete(msg) {
// delegate commandComplete to active query
if (this.activeQuery) {
this.activeQuery.handleCommandComplete(msg, this.connection)
}
} More about our environment: We are Timescale customers and we use their managed db as a service In our prod environment we do make use of pgbouncer we are at version 1.21.0-p2 Additionally we are not using the pg native option Thanks again! |
Many of our services use a mix of SELECT, UPDATE, and INSERT however I can say that for one of the services which has been effected today this service is 100% SELECT statements and the service is not using prepared statements. |
Thanks! The only thing I can think of that's triggering this is postgres response messages are coming in "out of order" and the client is getting confused. Not sure if pgbouncer or pgproxy is the culprit or something else but this is how the (extended) protocol works (mostly found here)
Only when the pg client receives My hunch (might be wildly wrong) is that one of the proxies or pg-bouncer is getting the postgres response messages out of order or sent to the wrong client? Particularly when you say this:
If a service isn't using prepared statements then its never even getting into the extended query protocol and those clients should never receieve a |
I think something I could work on with ya at some point possibly is at least making this not a catastrophic uncatchable error which is likely crashing the node process, but instead makes the client emit an error you can handle. This way you could catch it and then (I would highly recommend) close the client out & throw it away as it somehow got its internal state machine poisoned w/ what is likely out of order messaging (which is v strange) |
OK I've already forked the repo and I've started down the path of making some edits to the lib which we can deploy to prod
About this we use pg.Pool in every service which uses pg, is it possible to tell the pool to destroy/evict a specific connection? Thanks!! |
OK I've modified lib/client.js to check that this.activeQuery is non-null before every access and I think this code is probably good to deploy to the SELECT-only prod service. About the "stalled connection" we use |
Note that “prepared statements” in the sense of the extended query protocol also includes any query with a parameter. Like the other times this error has come up in issues: I would definitely not suggest ignoring the problem by checking for
is absolutely a possibility.
|
Hey @charmander thanks for writing back! OK I was not aware that parameterized queries count towards prepared statements we do use 99% parameterized queries We have a support plan with Timescale and I can point them towards this issue to see if maybe they can help server-side So since we are in the context of the extended query protocol it does make sense to me that maybe the naive approach of skipping over the Like results/rows for one query are returned to the wrong caller If you have any recommendations of further instrumentation/debugging we can add to our stack to investigate this we'll do what we can to get it in place Thanks again!! |
Could it be of any use to trial the pg native feature with our SELECT only service ? |
You could try that out - it might not have the problem. It might throw away messages out of order internally or be more lax w/ them in some way? Note: node-libpq doesn't compile on node 20 yet - that's also on my list for next month!
Two ways to do this:
Sorry yeah I should have been more clear - any query which contains parameters goes through the "extended query protocol."
In side I gotta go to bed - got my eyes dialated today and man alive it sucks to read. I'll be back w/ more during the day tomorrow. Sucks so bad we can't reproduce this in any isolated way - would really love to figure out what this is and fix it. |
Hey @brianc oh my yes I remember the last time I had my eyes dilated and I had to find my way onto the subway after it was not easy! Last night just before I passed out something @charmander raised about the parameterized query made me remember about the new prepared statement support in pgbouncer.
https://www.pgbouncer.org/config.html#max_prepared_statements In prod we are using pgbouncer 1.21.0 so the max_prepared_statements config param is supported but we have not specified a value for this in our config and the docs say the default value is zero (disabled). For pgbouncer we use the "transaction" pooling mode and when it was first recommended to us to use pgbouncer we asked about this parameterized queries and it was explained to us that parameterized queries dont count towards prepared statements but it seems maybe that was not correct. It seems likely to me from both of your responses that we should probably give Just a question about this bit
I think what they are saying is if the client lib is using the binary protocol to prepare statements then pgbouncer will rewrite and support them but if the library is using text strings this pgbouncer feature will not work with transaction pooling mode, do you read this the same way and can you say if the library is using the binary protocol or text strings? About rolling this out I am going to spend the day doing some testing locally with Thanks!! |
yeah I am pretty sure what that's saying is if you put a As far as I know there's not a lot of difference between a prepared statement and a parameterized query. From my understanding every time you execute a parameterized query you technically create a prepared statement without a name so every parameterized query replaces the unnamed prepared statement. If you do this await client.query({
text: 'SELECT $1::text',
params: ['hello world'],
}) you end up telling postgres
However, if you do this await client.query({
text: 'SELECT $1::text',
params: ['hello world'],
name: 'hello world query'
}) you end up telling postgres
the next time you execute a query on the same client named For example this: await client.query({
text: 'SELECT $1::text',
params: ['goodbye for now'],
name: 'hello world query'
})
Effectively you skip the parse step (which is where the query is planned for the most part AFAIK). This saves 1 network roundtrip and perhaps a millisecond or two for query planning. I wouldn't recommend using this all the time, but I've used in the past when we had a couple extremely large query texts (like 1000 lines long) and we execute that query regularly, we can skip the network round-trip of sending a 100kb of query text over the network hundreds of time a second. In my experience its a micro-optimization that doesn't really do much...but anyways I'm just getting into the weeds here to explain how things work, and to explain why I often conflate the terms "prepared statement" and "parameterized query" - they're very, very similar from a protocol perspective. |
@brianc thanks a lot for this explanation and how you describe it makes good sense to me, I spent this morning trying to create a POC which would reproduce the bug locally however I have yet to reproduce it maybe you have some ideas on what can be changed to increase the chance of reproducing https://github.com/rhodey/pgbouncetest What the test does
The idea is to create a large number of different parameterized queries and exec them in parallel while connected to pgbouncer with max_prepared_statements disabled. What I think is happening in prod is that ConnectionA is getting prepared statement messages which are meant for a query on ConnectionB do you agree? My laptop is relatively low performance w/ 4 CPUs and I was unable to have any of the children exit with an error, I had a co-worker with a 24 CPU machine run this test for 5 minutes and they also had no children processes crash. Note that pgbouncer max_db_connections is set to 2 which I believe is what you'd want to maximize the chance of one connection seeing prepared statement messages from another connection. I would be real happy to reproduce this issue locally and then have the max_prepared_statements = 200 config prevent the problem and then it would be a no brainer to deploy this to prod but so far that hasn't happened I think if today goes by and we are still unable to reproduce the issue locally before tonight we may still end up deploying a modified pgbouncer with max_prepared_statements = 200 Actually in prod today we have yet to see a single instance of either of these errors turn up, not since last night March 20 22:26 Eastern Time Something about this test is that it is unlikely that any of the children get one or more rows returned from the random queries just based on how the code has been designed, I did not think it was important to have the queries returning data but perhaps you would know better and I could refactor, Thanks again!! |
We are also seeing this error / stack trace in production. In our case, I believe this started appearing after we introduced an endpoint that's using pg-query-stream. Our app is connecting to Cloud SQL on GCP through their proxy and using an ORM, but I've been trying to get a minimal repro. I can trigger this locally by issuing dozens of concurrent requests which abort after a short time ( In this case, I'm connecting over localhost to a pg server running in docker. The extended protocol doesn't seem to be required. This is selecting from our actual schema; I've not yet been able to reproduce with a single dummy db/table which looks similar, and I'm not sure why that might be. const pool = new pg.Pool({ ...clientConfig, allowExitOnIdle: true, max: 10 });
const server = createServer(async (req, res) => {
const client = await pool.connect();
const query = new QueryStream(`select id from ${tableName} limit 100`);
const stream = client.query(query);
res.writeHead(200, { "Content-Type": "application/json" });
await pipeline(stream, new Serializer(), res)
.then(() => {
client.release();
})
.catch((err) => {
client.release();
if ("code" in err && err.code === "ERR_STREAM_PREMATURE_CLOSE") {
// client abort
} else {
console.error(`stream error`, err);
}
});
});
server.listen(9999, () => {
console.log("Server listening on port 9999");
});
class Serializer extends Transform {
constructor() {
super({
objectMode: true,
});
}
_transform(newChunk, _, callback) {
this.push(JSON.stringify(newChunk));
callback(null, "\n");
}
} |
Hello all, returning here to share some new information So after last writing this issue continued to show up for us approx once every 3 weeks and when it did happen it would happen with like 6 errors all in the span of a minute and then not again for 3 weeks or we would get clusters of ~6 errors every 45-60min for 24 hours Given that this error seemed to be so deep into the guts of this library and given that we were not able to artificially reproduce it we decided the quickest sure fix for us was to write a minimal postgresql library ourselves in java where postgresql drivers have many years of reliable success So we wrote a psql library in java and we spawn this java jar from node as a child process and pipe data to and from stdin & stdout it took about a week of continuous work These errors showed up again this Thursday and continued into Friday and we deployed our new lib to approx 10% of our services in prod last night and no errors through the night or this morning Bad news for us... this class of errors started showing up again only about 30 minutes ago today (Saturday) and this time we have services that are using this pg library crashing and services that our using our new java based pg library crashing also within the same minute. Overall not a great result for our team but this does make me more confident in the reliability of this pg library We still very much would like to identify the root of the issue here and I gotta say we tried using the java I mentioned this before but our stack is pgbouncer + timescale which is built atop postgres I also mentioned before that we do not use any named prepared statements we are at most just using the parameter binding with $1, $2, $3 etc I understand debugging this may not be top priority for this lib since the evidence seems to point to this being an issue with servers and not an issue with the clients but if this project was to change to help with this I think that since Pool is an instance of EventEmitter it could emit some of these situations as error events instead of hitting a null ref and dying or maybe the error event is not the right place for these messages to go they could be emitted as an event with name debug @bryfox can you write back to say if you happen to be using pgbouncer in your stack or timescale? thanks all, |
I think what I would like to see from this library is when it shows up that something is null and it shouldn't be null the promise for the specific query rejects with a descriptive error If the user has a ref to a Client and the whole client is now in an invalid state because of the null then the client emits an error event and future attempts to client.query fail And if the user has a ref to a Pool and the whole pool is not in an invalid state because of the null then the pool emits an error event and future attempts to pool.query fail I would like to say that if a Client or Pool emits an error then it is time to re-create a new client or a new pool but I have seen that Pool will sometimes emit errors for when connections in its pool close unexpectedly even if the connection is not being utilized by any active client or query() so this rule cannot be easily applied, instead could probably do some string matching on the text in the error event message in order to know if the pool should be recreated |
I don’t see why there would be any need to recreate the pool. Emitting |
Thanks for writing @charmander, I'm glad to hear you like the This issue is a priority for us and we would like to know what the root cause is also but we are totally unable to reproduce this issue artificially so to be honest our focus first is on changing our situation so that these errors no longer lead to crashes and IMO at this time that means catching the errors and retrying the failed query. I would like to add to this library but just to be honest again I need to do whats going to have the best chance of working ASAP for the company I'm working for and that may mean adding a retry policy to this new java-based lib we started because in this java lib the error is already being thrown we just need to catch it. I'm trying to think of what I could do to patch this lib to emit the I'll be working on this issue by adding to our java lib or by adding to this node lib starting tomorrow morning, if you've got any ideas for how this could be patched into the node js lib safely and efficiently lmk and maybe that will make sense for our co and I can help. Thanks again! |
We are not. We're using Cloud SQL (GCP), through the google auth proxy. The auth proxy runs in a sidecar container; it can disconnect unexpectedly, but it rarely does. PG version 13.13. pg-pool on the client side. We have only seen this when streaming responses using Our current workaround is to close the client connection when the underlying stream encounters an error (including client abort — but so far aborts are the only errors we've seen), instead of trying to reuse it. We haven't observed this in production since then. |
thanks for writing back @bryfox ! I have some updates to share, so we are having early success with adding a retry policy to our node+java psql library, we are still experiencing these seem-to-be protocol level errors but for the 15% of our stack which we've deployed this new psql lib to the exceptions from the protocol errors are caught and the queries are retried with 600ms delay and they succeed on the first or second retry I added a debug logging feature to this node+java psql library so that we get the full java stack trace for the errors which are retried and so far we have two distinct stack traces showing up each come from within the default java postgresql library so again we are fairly certain at this point our applications are not at fault We plan to push this new lib out to another section of our stack tonight and we intend to collect more of the stack traces so we can better categorize the error, I will share the result of this here with this thread and we will probably publish our node+java library also so the stack traces can be better understood in context. This new node+java lib that we wrote is approx 5-15ms slower than pg and this is acceptable for a majority of our services but for some of our services which serve user-facing queries we cannot accept this performance hit so we are motivated to contribute to this library namely catching these null access errors and emitting them as I expect to begin the work to emit these Thanks! |
Hello all I have some updates to share we are able to share the source code for our java+node psql driver and we have been able to collect more detailed stack traces from this library versus what we have gotten from pg Here is the driver we wrote to attempt to reproduce the issue: https://github.com/rhodey/java-pipe-sql And here are what we have learned from the stack traces: https://github.com/rhodey/java-pipe-sql?tab=readme-ov-file#java-stack-traces The stack traces again point to protocol errors caused by the postgresql (timescaledb) server or caused by middleware such as pgproxy or pgbouncer |
Hey @brianc glad to report that we today finally caught this error in prod in one of the few services we deployed the patch to, the error was caught by an event listener on the Pool instance and it got logged to console like I wanted. I see that you have merged in the commit and done a release so we will be moving all our services to use the new release.
It makes sense to me that you can close this issue if you would like |
The following code triggers other unordered messages resulting in a similar scenario for local environments: Environment: Docker const client = await pool.connect();
const clientErrorListener = (error: Error) => {
console.error("Error in client (cursor): ", error.message);
};
client.on("error", clientErrorListener);
const firstCursor = client.query(
new Cursor("SELECT 1", undefined, { rowMode: "array" })
);
firstCursor.read(100, () => console.log("Read first cursor."));
await new Promise((res) =>
firstCursor.close(() => {
console.log("Close first cursor.");
res(true);
})
);
client.on("error", clientErrorListener);
const secondCursor = client.query(
new Cursor("SELECT 1", undefined, { rowMode: "array" })
);
secondCursor.read(100, () => console.log("Read second cursor."));
secondCursor.close();
// Extra time to get the fauly response.
expect(await new Promise((res) => setTimeout(() => res, 100))); Errors: 1. TypeError: Cannot read properties of null (reading 'handleRowDescription')
2. TypeError: Cannot read properties of null (reading 'handleDataRow') Fix (the same as in #3289): _handleRowDescription(msg) {
if (this.activeQuery == null) {
const error = new Error('Received unexpected _handleRowDescription message from backend.')
this._handleErrorEvent(error)
return
}
// delegate rowDescription to active query
this.activeQuery.handleRowDescription(msg)
}
_handleDataRow(msg) {
if (this.activeQuery == null) {
const error = new Error('Received unexpected _handleDataRow message from backend.')
this._handleErrorEvent(error)
return
}
// delegate dataRow to active query
this.activeQuery.handleDataRow(msg)
} |
@brianc et al, this may be a silly question -- apologies. Am I correct in thinking that these out-of-order messages -- which show up here and elsewhere -- are impossible to manage, and the best we're going to get is catch-able errors? If not, am I missing an issue / PR where a possible fix is being discussed? |
@andrewhoughton-sureify Right: in general, there’s no correct way of handling bad messages if the database is part of your application; it’s a critical bug that can result in silently dangerously incorrect behavior. The most commonly identified cases seem to have to do with (misuses of?) pg-cursor. I haven’t looked into exactly how broken things can get there, but for anyone else who hasn’t, it’s best to assume the worst and fix the bug (possibly by avoiding pg-cursor entirely) instead of deploying a patched version of pg that checks for and ignores this condition/just closes the offending client. It’s still a good idea for pg to expose these errors nicely (including less common use cases, like connecting to PostgreSQL servers that aren’t part of the app using pg), but there is no safe action to take when you catch one. It’s like catching a segfault. |
Hello All, thanks to all the contributors who keep this project going, I have an issue to report
Before today I ran into this issue in prod approx 6 times total in ~12 months, but today for some reason this issue has turned up six times.
We have approx 24 services with 200 tasks running on AWS ECS and 90% of these services make use of this pg library in one way or another, another maybe important note is that they all use the pg.Pool
So the stack trace we get in prod from multiple services is:
So most of that stack trace does not say a lot but I think what it is saying is to see this line here:
node-postgres/packages/pg/lib/client.js
Line 389 in 91de4b9
And it must be saying that
this.activeQuery
is nullNow I could open a PR and modify this function:
And simply change it to:
There is a chance that is all that is needed but someone more knowledgeable about the library would know if this is maybe a symptom of a larger problem. Anyone online care to comment ??
I am concerned about patching the library and testing it out on our prod stack because I don't want my edit to cause the lib to somehow be confused and make a serious error.
Can anyone comment on if this edit seems like a safe thing to try we are not able to reproduce this outside of prod.
I can say that the services effected today none of them use transactions maybe that is helpful info
pg version 8.10.0
Thanks!!
The text was updated successfully, but these errors were encountered: