Significantly reduce load time on the Show page for large datasets #1989

hectorcorrea · 2024-11-15T21:05:50Z

Although we already loaded the file list via AJAX on the Show page, we still had several calls to fetch the file list as we rendered the page (e.g. to calculate and show the total file size.) For large works (e.g. works with 60K files) this calculation takes 30+ seconds which mean the page took 30+ seconds to load to the user.

In this PR I moved all the code that depends on the file list to display those values after the page has been rendered and the file list has been fetched (via AJAX).

While troubleshooting this issue I also noticed that our code to detect changes in the Upload Snapshots was incredibly slow on this kind of large datasets. Comparing 60,000 files 60,000 times for deletes + adds/modifications was very slow. Even thought this work is done in the background our implementation would take minutes to complete. So I updated the code that calculates the deletes + adds/modifications to use a binary search (instead of a regular Array include?) and that improved performance significantly (3-4 seconds instead of 1+ minute).

Closes #1623

This has been deployed so Staging and it can be tested with this work that has 40K files: https://pdc-describe-staging.princeton.edu/describe/works/650

… line but rather pushing the calculation to happend in the AJAX call for the file list.

…and deletes by using a binary search rather than a plain include

hectorcorrea · 2024-11-20T20:46:46Z

app/controllers/works_controller.rb

+    # can consume. The `data` elements includes the work's file list all other
+    # properties are used for displaying different data elements related but not
+    # directly on the DataTable object (e.g. the total file size)
+    def file_list_ajax_response(work)


Notice how we return the file list plus a bunch of other values that depend on the file list

hectorcorrea · 2024-11-20T20:47:06Z

app/views/works/_s3_resources.html.erb

+          // The JSON payload includes a few extra properties (fetched via AJAX because
+          // they be slow for large datasets). Here we pick up those properties and update
+          // the display with them.
+          $("#total-file-size").text(json.total_size_display);


We update the page with the extra values that now come on the AJAX response

hectorcorrea · 2024-11-20T20:47:51Z

app/models/upload_snapshot.rb

+      # much faster when the list of files is large. Notice that the binary search
+      # requires that the list of files is sorted.
+      # See https://ruby-doc.org/3.3.6/bsearch_rdoc.html
+      if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil?


This is the other optimization, notice the use of Binary Search bsearch instead of a normal include?

hectorcorrea · 2024-11-20T20:48:35Z

app/models/work.rb

+
+    def log_performance(start, message)
+      elapsed = Time.zone.now - start
+      if elapsed > 20


We can tweak the threshold, but I thought it would be good to start logging as warning slow requests

jrgriffiniii · 2024-11-20T21:14:08Z

app/controllers/works_controller.rb

+      {
+        data: files,
+        total_size:,
+        total_size_display: ActiveSupport::NumberHelper.number_to_human_size(total_size),


jrgriffiniii · 2024-11-20T21:14:24Z

app/models/upload_snapshot.rb

+      # much faster when the list of files is large. Notice that the binary search
+      # requires that the list of files is sorted.
+      # See https://ruby-doc.org/3.3.6/bsearch_rdoc.html
+      if s3_filenames_sorted.bsearch { |s3_filename| filename <=> s3_filename }.nil?


jrgriffiniii · 2024-11-20T21:14:57Z

app/models/work.rb

+  # Calculates the total file size from a given list of files
+  # This is so that we don't fetch the list twice from AWS since it can be expensive when
+  # there are thousands of files on the work.
+  def total_file_size_from_list(files)


WIP

0856d92

hectorcorrea changed the title ~~WIP~~ Reduce the load time on the show page Nov 15, 2024

hectorcorrea added 3 commits November 20, 2024 10:46

Got the show page to load faster by not calculating the total size in…

e0a5464

… line but rather pushing the calculation to happend in the AJAX call for the file list.

Improve the performance of the jobs to detect snapshot modifications …

3964f14

…and deletes by using a binary search rather than a plain include

Updated the upload snaphot tests

89fcc49

hectorcorrea changed the title ~~Reduce the load time on the show page~~ Significantly reduce load time on the Show page for large datasets Nov 20, 2024

hectorcorrea commented Nov 20, 2024

View reviewed changes

hectorcorrea and others added 2 commits November 20, 2024 15:58

Adjust test to match new return value

b62c634

Merge branch 'main' into 1623-slow-show-page

4199222

hectorcorrea marked this pull request as ready for review November 20, 2024 20:58

Rubocop nitpicking

a91e481

pulbot temporarily deployed to staging November 20, 2024 21:06 Inactive

jrgriffiniii reviewed Nov 20, 2024

View reviewed changes

jrgriffiniii approved these changes Nov 20, 2024

View reviewed changes

hectorcorrea mentioned this pull request Nov 20, 2024

Works with more than 10,000 files #1623

Closed

hectorcorrea merged commit 18bb510 into main Nov 20, 2024
5 checks passed

hectorcorrea deleted the 1623-slow-show-page branch November 20, 2024 21:57

hectorcorrea mentioned this pull request Nov 21, 2024

Cache total file size/count #1992

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly reduce load time on the Show page for large datasets #1989

Significantly reduce load time on the Show page for large datasets #1989

hectorcorrea commented Nov 15, 2024 •

edited

Loading

hectorcorrea Nov 20, 2024

hectorcorrea Nov 20, 2024

hectorcorrea Nov 20, 2024

hectorcorrea Nov 20, 2024

jrgriffiniii Nov 20, 2024

jrgriffiniii Nov 20, 2024

jrgriffiniii Nov 20, 2024

Significantly reduce load time on the Show page for large datasets #1989

Significantly reduce load time on the Show page for large datasets #1989

Conversation

hectorcorrea commented Nov 15, 2024 • edited Loading

hectorcorrea Nov 20, 2024

Choose a reason for hiding this comment

hectorcorrea Nov 20, 2024

Choose a reason for hiding this comment

hectorcorrea Nov 20, 2024

Choose a reason for hiding this comment

hectorcorrea Nov 20, 2024

Choose a reason for hiding this comment

jrgriffiniii Nov 20, 2024

Choose a reason for hiding this comment

jrgriffiniii Nov 20, 2024

Choose a reason for hiding this comment

jrgriffiniii Nov 20, 2024

Choose a reason for hiding this comment

hectorcorrea commented Nov 15, 2024 •

edited

Loading