-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significantly reduce load time on the Show page for large datasets #1989
Conversation
… line but rather pushing the calculation to happend in the AJAX call for the file list.
…and deletes by using a binary search rather than a plain include
# can consume. The `data` elements includes the work's file list all other | ||
# properties are used for displaying different data elements related but not | ||
# directly on the DataTable object (e.g. the total file size) | ||
def file_list_ajax_response(work) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notice how we return the file list plus a bunch of other values that depend on the file list
// The JSON payload includes a few extra properties (fetched via AJAX because | ||
// they be slow for large datasets). Here we pick up those properties and update | ||
// the display with them. | ||
$("#total-file-size").text(json.total_size_display); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We update the page with the extra values that now come on the AJAX response
# much faster when the list of files is large. Notice that the binary search | ||
# requires that the list of files is sorted. | ||
# See https://ruby-doc.org/3.3.6/bsearch_rdoc.html | ||
if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the other optimization, notice the use of Binary Search bsearch
instead of a normal include?
|
||
def log_performance(start, message) | ||
elapsed = Time.zone.now - start | ||
if elapsed > 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can tweak the threshold, but I thought it would be good to start logging as warning slow requests
{ | ||
data: files, | ||
total_size:, | ||
total_size_display: ActiveSupport::NumberHelper.number_to_human_size(total_size), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
# much faster when the list of files is large. Notice that the binary search | ||
# requires that the list of files is sorted. | ||
# See https://ruby-doc.org/3.3.6/bsearch_rdoc.html | ||
if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
# Calculates the total file size from a given list of files | ||
# This is so that we don't fetch the list twice from AWS since it can be expensive when | ||
# there are thousands of files on the work. | ||
def total_file_size_from_list(files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Although we already loaded the file list via AJAX on the Show page, we still had several calls to fetch the file list as we rendered the page (e.g. to calculate and show the total file size.) For large works (e.g. works with 60K files) this calculation takes 30+ seconds which mean the page took 30+ seconds to load to the user.
In this PR I moved all the code that depends on the file list to display those values after the page has been rendered and the file list has been fetched (via AJAX).
While troubleshooting this issue I also noticed that our code to detect changes in the Upload Snapshots was incredibly slow on this kind of large datasets. Comparing 60,000 files 60,000 times for deletes + adds/modifications was very slow. Even thought this work is done in the background our implementation would take minutes to complete. So I updated the code that calculates the deletes + adds/modifications to use a binary search (instead of a regular Array
include?
) and that improved performance significantly (3-4 seconds instead of 1+ minute).Closes #1623
This has been deployed so Staging and it can be tested with this work that has 40K files: https://pdc-describe-staging.princeton.edu/describe/works/650