Update to new implementation of NPLD limitations #54

anjackson · 2020-01-15T13:41:58Z

We are moving to a new, simpler implementation of the NPLD limitations. Rather than going through a remote desktop, clients will access Wayback directly. This means we need to do a few things:

Modify the single-concurrent-use locking
Limit how much text can be copied at once
Use cache control headers to limit how much content gets stored on the client
Prevent download of non-web content

Single-Concurrent-Use

The will be no login/logout hooks, so a simple alternative locking mechanism is proposed.

The default behaviour is that all 'top-level' URLs will be lock to a user's cookie session, set to time-out at midnight later that day. As before, transcluded items should not be locked. These locks are managed server-side.

To enable the lock to be released earlier, the lock can be polled and repeatedly updated from the Wayback JavaScript client, with a time-out set to a few minutes in the future. While a page is being viewed, it will still be locked to the current user, but once they move on it should time out in a few minutes as the lock is no longer being updated.

This means files that get downloaded will be locked for the whole day, but most pages should be released promptly.

Limit cut-and-paste

The client-side JavaScript should intervene during cut/copy events and limit the text to a configurable amount.

Limit local caching

The server should add headers to limit local caching, as per https://stackoverflow.com/questions/9884513/avoid-caching-of-the-http-responses -- this may be better done via NGINX?

Prevent downloads of non-web content

We need to try to prevent content being downloaded to local machines, and use a secondary service for rendering some formats to HTML.

First step is to intercept direct downloads of content other than HTML. These will then either be blocked (probably with a custom 451 error) or passed to an external service for rendering.

We will need some lookup table that maps Content Types to URL templates, e.g.

application/msword, http://service.things.com/url={url}

Or similar. When we hit a non-web type, we should open up the block page, and if there's a mapping, offer to redirect the user to that URL for access. For all types, we should ensure the Content-Disposition header is blocked so downloads can't be forced that way.

i.e. this is similar to the old Interject idea (source code & tech docs here).

The text was updated successfully, but these errors were encountered:

anjackson · 2020-02-24T15:04:02Z

Updated following clarification of download limits.

ikreymer · 2020-04-15T04:22:48Z

To clarify, the default behavior is that a resource remains locked, unless it is explicitly unlocked by the same client, right?
Otherwise, the default lock will only be for a few minutes, until it is no longer being polled.

Eg. If client with cookie A locks https://example.com/, and then client with cookie A visits https://example.com/foobar, moving its lock to that url, and unlocking https://example.com/.
If the client then closes the browser, https://example.com/foobar remains locked until midnight?

OR

Each https://example.com/ is locked by A, then client moves to https://example.com/foobar and locks that. The lock for https://example.com/ expires after a few minutes. When user closes the session, the lock for https://example.com/foobar expires as well. If a user attempts to download a file (which would trigger an interstitial behavior as outlined), the lock is acquired and then expires. But under this approach, the lock would never last until midnight, since it would expire after no longer being polled by the client?

Or perhaps I am missing something?

ikreymer · 2020-04-15T04:31:15Z

Regarding 4), one possible tricky edge case may be if a certain type of resource could be either downloaded or embedded in the page. I guess maybe the only example is PDF, unless there is a custom viewer for ms-word somewhere... We know that for certain that if Content-Disposition was present, it is a download, otherwise I think it is not possible to tell for sure...

And for PDFs, I think if you use the default PDF plugin the browser provides, it is still possible to download the PDF from there.. I don't think there's a way to prevent that from the default PDF viewer.

anjackson · 2020-04-15T10:03:46Z

On locks, it's the second case. The idea was:

When visiting a URL, the lock initially has a timeout of midnight (set server-side)
If the polling from the client works, the lock timeout is shortened, but the client polling stops the lock from timing out, rather than explicitly releasing it.

So, the lock-till-midnight should rarely happen, as it is simply a fall-back in case for some reason the client-side locking protocol fails.

On (4), I was imagining we'd sniff the Content-Type from the WARC record on the server side and block/redirect if it's not HTML or an embed. If that makes sense?

- When pinged at /_locks/ping, the lock for the referring url is set to expire after LOCK_PING_EXTEND_TIME seconds - Ping is set to happen every LOCK_PING_INTERVAL seconds - Copy and/or selection of text is limited to SELECT_LIMIT_WORD words (if any) for single-use-lock collections - config 'add_headers' option can be used to specify extra cache-control and other headers - config 'content_type_redirects' option can be used to specify content-types for which to redirect to a custom viewer or block page, and <any-download> also adds redirects for any response with 'content-disposition: attachment' set - update docs/locks.md with new features - update config.yaml with add_headers and content_type_redirects - tests: update tests to test new features, add WARCs to test custom content-types and content-disposition headers - selection limits via static/selection-limits.js

ikreymer · 2020-04-16T02:35:04Z

Yep, kept the initial lock mechanism, but ping shortens the lock for the referring url.
First pass in the above commit.

Updated docs for the new features:
https://github.com/ukwa/ukwa-pywb/blob/2.4.0-beta/docs/locks.md#ping-session-refresh

anjackson · 2020-04-22T19:14:10Z

Generally, looks good. Unfortunately, I think the content-type block will need some way to act more like a allow list than a block list. e.g. unknown or unspecified formats should not be downloadable, so we'll need some way of saying 'web formats allowed' (html/jpg/css/js/png/etc.).

Which is unpleasant but necessary.

content type redirects: support default block list, with specific allow rules, per #54

ikreymer · 2020-05-01T06:25:54Z

How about something like this?

        content_type_redirects:
          # allows
          'text/': 'allow'
          'image/': 'allow'
          'video/': 'allow'
          'audio/': 'allow'
          'application/javascript': 'allow'

          'text/rtf': 'https://example.com/viewer?{query}'
          'application/pdf': 'https://example.com/viewer?{query}'
          'application/': 'https://example.com/blocked?{query}'

          # default redirects
          '<any-download>': 'https://example.com/blocked?{query}'
          '*': 'https://example.com/blocked?{query}'

The content-disposition is checked first so always takes precedence, then exact match,
followed by the mime prefix (eg. text/) match, followed by '*' wildcard.
With '*' set to redirect, all unlisted mimes will be redirected.

If this makes sense, can expand it with more mime types.
May be more convenient to move to separate file from config.yaml

ikreymer added a commit that referenced this issue May 1, 2020

docker: update to pywb rc7

2a9c078

content type redirects: support default block list, with specific allow rules, per #54

anjackson assigned ikreymer and anjackson May 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to new implementation of NPLD limitations #54

Update to new implementation of NPLD limitations #54

anjackson commented Jan 15, 2020 •

edited

Loading

anjackson commented Feb 24, 2020

ikreymer commented Apr 15, 2020

ikreymer commented Apr 15, 2020

anjackson commented Apr 15, 2020

ikreymer commented Apr 16, 2020

anjackson commented Apr 22, 2020

ikreymer commented May 1, 2020 •

edited

Loading

Update to new implementation of NPLD limitations #54

Update to new implementation of NPLD limitations #54

Comments

anjackson commented Jan 15, 2020 • edited Loading

Single-Concurrent-Use

Limit cut-and-paste

Limit local caching

Prevent downloads of non-web content

anjackson commented Feb 24, 2020

ikreymer commented Apr 15, 2020

ikreymer commented Apr 15, 2020

anjackson commented Apr 15, 2020

ikreymer commented Apr 16, 2020

anjackson commented Apr 22, 2020

ikreymer commented May 1, 2020 • edited Loading

anjackson commented Jan 15, 2020 •

edited

Loading

ikreymer commented May 1, 2020 •

edited

Loading