Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly reduce load time on the Show page for large datasets #1989

Merged
merged 7 commits into from
Nov 20, 2024

Conversation

hectorcorrea
Copy link
Member

@hectorcorrea hectorcorrea commented Nov 15, 2024

Although we already loaded the file list via AJAX on the Show page, we still had several calls to fetch the file list as we rendered the page (e.g. to calculate and show the total file size.) For large works (e.g. works with 60K files) this calculation takes 30+ seconds which mean the page took 30+ seconds to load to the user.

In this PR I moved all the code that depends on the file list to display those values after the page has been rendered and the file list has been fetched (via AJAX).

While troubleshooting this issue I also noticed that our code to detect changes in the Upload Snapshots was incredibly slow on this kind of large datasets. Comparing 60,000 files 60,000 times for deletes + adds/modifications was very slow. Even thought this work is done in the background our implementation would take minutes to complete. So I updated the code that calculates the deletes + adds/modifications to use a binary search (instead of a regular Array include?) and that improved performance significantly (3-4 seconds instead of 1+ minute).

Closes #1623

This has been deployed so Staging and it can be tested with this work that has 40K files: https://pdc-describe-staging.princeton.edu/describe/works/650

@hectorcorrea hectorcorrea changed the title WIP Reduce the load time on the show page Nov 15, 2024
… line but rather pushing the calculation to happend in the AJAX call for the file list.
…and deletes by using a binary search rather than a plain include
@hectorcorrea hectorcorrea changed the title Reduce the load time on the show page Significantly reduce load time on the Show page for large datasets Nov 20, 2024
# can consume. The `data` elements includes the work's file list all other
# properties are used for displaying different data elements related but not
# directly on the DataTable object (e.g. the total file size)
def file_list_ajax_response(work)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice how we return the file list plus a bunch of other values that depend on the file list

// The JSON payload includes a few extra properties (fetched via AJAX because
// they be slow for large datasets). Here we pick up those properties and update
// the display with them.
$("#total-file-size").text(json.total_size_display);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We update the page with the extra values that now come on the AJAX response

# much faster when the list of files is large. Notice that the binary search
# requires that the list of files is sorted.
# See https://ruby-doc.org/3.3.6/bsearch_rdoc.html
if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil?
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other optimization, notice the use of Binary Search bsearch instead of a normal include?


def log_performance(start, message)
elapsed = Time.zone.now - start
if elapsed > 20
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can tweak the threshold, but I thought it would be good to start logging as warning slow requests

@hectorcorrea hectorcorrea marked this pull request as ready for review November 20, 2024 20:58
{
data: files,
total_size:,
total_size_display: ActiveSupport::NumberHelper.number_to_human_size(total_size),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

# much faster when the list of files is large. Notice that the binary search
# requires that the list of files is sorted.
# See https://ruby-doc.org/3.3.6/bsearch_rdoc.html
if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

# Calculates the total file size from a given list of files
# This is so that we don't fetch the list twice from AWS since it can be expensive when
# there are thousands of files on the work.
def total_file_size_from_list(files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@hectorcorrea hectorcorrea merged commit 18bb510 into main Nov 20, 2024
5 checks passed
@hectorcorrea hectorcorrea deleted the 1623-slow-show-page branch November 20, 2024 21:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Works with more than 10,000 files
3 participants