Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solr Backups not running #447

Open
maxkadel opened this issue Jan 10, 2025 · 6 comments
Open

Solr Backups not running #447

maxkadel opened this issue Jan 10, 2025 · 6 comments
Assignees
Labels

Comments

@maxkadel
Copy link
Contributor

Expected behavior

On the Solr 8 servers (e.g. lib-solr-prod7.princeton.edu) the directory /mnt/solr_backup/solr8/production should have subdirectories by day which include .bk backup files

Actual behavior

The /mnt/solr_backup/solr8/production directory contains sub-directories by day which are empty

Steps to replicate

SSH onto a solr 8 box and ls the sub-directories in /mnt/solr_backup/solr8/production

Impact of this bug

We cannot restore production data if there is an issue.

Implementation notes, if any

@maxkadel maxkadel added the bug label Jan 10, 2025
@hackartisan
Copy link
Member

hackartisan commented Jan 13, 2025

If invoked directly it runs on staging (it looks like the schedule.rb configuration won't ever execute for the staging machine, since it's not in the :db group, so staging backups aren't running regularly -- I assume this is on purpose).

One difference I notice is that on staging the directories are group-writable, and on prod they are not. Perhaps that is the issue?

I double-checked the refactors I made in #440 and they seem to produce the same api urls.

@hackartisan
Copy link
Member

The async request status ids are written to /tmp/solr-backup.log

Here is an example of the output I get querying one of those request statuses:

deploy@lib-solr-prod7:~$ curl "http://localhost:8983/solr/admin/collections?action=REQUESTSTATUS&requestid=figgy-production-202501132037"
{
  "responseHeader":{
    "status":0,
    "QTime":11},
  "Operation backup caused exception:":"java.nio.file.AccessDeniedException:java.nio.file.AccessDeniedException: /mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
  "exception":{
    "msg":"/mnt/solr_backup/solr8/production/20250113/figgy-production-20250113.bk",
    "rspCode":-1},
  "status":{
    "state":"failed",
    "msg":"found [figgy-production-202501132037] in failed tasks"}}

@hackartisan
Copy link
Member

More observations:

  • the last successful prod backup was on 2024/12/30. the one on the 31st failed and all of them since then have failed.

  • On prod boxes 8 and 9 the mount directories are owned by root root. Maybe these mounts got messed up somehow on the 30th. the solr backup requires all cloud machines to share the mounted directory, so however it's doing the backup is probably distributed. So with a permissions error on one machine the whole thing maybe fails.

@hackartisan
Copy link
Member

I checked ansible-alerts for anything run on dec 30th that could be relevant, but the closest things I found were Ansible ran update the Operating System packages on [lib-solr-prod8.princeton.edu](http://lib-solr-prod8.princeton.edu/) on the 31st. Those ran at 6-something am, eastern; the backups would have run at midnight UTC which is actually the evening of the 30th, and so they would have been done so this doesn't seem like it caused the failure.

@hackartisan
Copy link
Member

hackartisan commented Jan 14, 2025

the solr role should set the correct user when adding the mount:

https://github.com/pulibrary/princeton_ansible/blob/00513a948b432be32ddd00b3fafae651bb969b2b/roles/solrcloud/tasks/main.yml#L32

The id looks right on box 8. maybe the role just needs to be run for some reason.

@maxkadel
Copy link
Contributor Author

It looks like we had a PR to fix the permissions issue previously - not clear why the permissions would have changed again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants