-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loadbalancing Irregularity #116
Comments
So I went ahead and blew away my retention.dat files on all three naemon machines and then truncated the report_data and notification tables to try and 'start over'.
Do I have a configuration error somewhere? I was potentially thinking that having check_freshness enabled could be messing up merlin? or is that necessary for making the redundancy work? |
I actually spoke too soon. I did find a particular service check that now appears to be de-synchronized:
node_b was the machine that was not running naemon/merlin for 12 hours. It's report_data table seems OK in comparison to the others:
But looking in the retention.dat files ....
I can see they aren't synchronized. There are some other metrics that aren't matching up either. These appear to be different no matter what:
(and obviously check_latency) However the others that differ on the machine that was off-line:
|
Merlin works (at least now, can't speak for 6 years ago), by sending the result of checks to all nodes. This way all nodes should see the same checks, even if they are executed elsewhere. Naemon doesn't really know the difference of where the check results come from. So that Naemon logs the check results/alerts across all nodes, I would imagine is normal behaviour here, although I have not spent any time really looking into this. The issue with notifications being sent from the node that went offline is a bit strange. Do you have any logs for the notifications that were sent out? We don't generally touch the retention data file, although some use merlins inbuilt file sync, to sync it. That usually means if nodes come up at different stages, it can be a little out of sync. Normally things like This is the correct place for issues etc, so that's all good! Did you see references to op5.org anywhere? We Should get rid of those. |
Hi, so sorry for my delayed response. I appreciate your attention to my initial issue. I've been running in production for quite some time now - and things have been relatively stable. We had one 'split-brain' incident on our network where one of our data centers had some serious network problems (and thus one of our monitoring nodes did as well) while the other two monitoring nodes (and parent data centers) remained fully functional. The system weathered through that better than I would have expected. I've noticed that our merlin tables have just continued to increase in size. I wonder if these should be truncated occasionally?
node_b:
node_c:
I also encountered an odd notification logic issue the other night which has stumped me. There are two examples, first an incident where the host notification escalation kicked in as it should have:
Then in this case 'dcops_bogus' was never notified first, it simply jumped straight to paging:
Here is the relevant configuration:
And then the host escalation:
The logic behind the hostescalation dates back to over a decade ago. I discussed it briefly with my counterpart who helped build this at the time and we both suspect it was a work-around, likely before 'first_notification_delay' existed as an option. So I may just simplify this scenario and remove the escalation logic entirely (it's confusing) and try using 'first_notification_delay' instead. |
Yes, this is a problem with merlins architecture, given that there are no consensus algorithms or anything like that. For now you either live with it, and adjust the report data manually when required, or you put all peers as close as possible on the network, preferably behind the same switch.
We don't make any assumptions about what kind of retention people want on their report data. I think it does make sense to truncate the tables once in a while, perhaps after generating a report over the period. Merlin only log state changes, not every check, so usually it's fine to keep the report_data for a few years.
Yes that does look slightly weird. Perhaps worth investigating if you see similar behaviour without Merlin installed. I haven't heard of anyone with issues regarding |
My apologies for posting this here - I was looking for the op5 mailing list (which I was subscribed to years ago) but it appears op5.org is gone ...
I am switching from a very old version of Nagios/Merlin ... probably 6 years old at least. I am seeing some behavior with the newer version that I do not understand. I am running Naemon 1.2.4-1 and pulled Merlin from git today.
In the past, the nagios node running a check would log the 'SERVICE ALERT' for that check. I had all three of our peered nagios/merlin machines syslogging to each other so I had a nice view of where checks were being run and what was happening.
With the new version it appears certain events are being "echoed" by all the naemon daemons like this:
However, on some occasions the behavior is much different:
In the second example only "node_b" and "node_c" appear to be echoing these events, but what is additionally concerning is the retry values do not sync up correctly. ie. at 21:41:37 node_b thought this was the 55th retry while node_c thought it was the 63rd.
Here is a
mon node status
I looked through the troubleshooting notes and did get identical hashes for this command:
mon node ctrl --type=peer -- mon oconf hash
But my object.cache files do not have the same hash value. I looked through the files side-by-side (and with diff) and it looks like Naemon is listing the 'members' of certain host/service/etc groups in a randomized order, which throws the hashes off from each other.
I looked at the merlin database and found that the 'report_data' table had wildly different rows of data. Could have been the result of testing the system. So I truncated that table and started over ... now all three have roughly the same number of rows ... 612, 619, 618. But that hasn't really resolved the problem with what's getting logged by Naemon.
I did a specific query for the device in question "pdu-f10" in the report_data table, and all three databases have the same exact info ... 'timestamp' and 'retry' are all consistent ... but the 'id' for each row is different (which I guess makes sense since it's auto-increment). So one Naemon node logs the right retry value, another one logs a retry value that's less, and the other one logs nothing at all.
Again my sincere apologies for posting this as an 'issue' and not a general support request elsewhere!
The text was updated successfully, but these errors were encountered: