Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to new implementation of NPLD limitations #54

Open
anjackson opened this issue Jan 15, 2020 · 7 comments
Open

Update to new implementation of NPLD limitations #54

anjackson opened this issue Jan 15, 2020 · 7 comments
Assignees

Comments

@anjackson
Copy link
Contributor

anjackson commented Jan 15, 2020

We are moving to a new, simpler implementation of the NPLD limitations. Rather than going through a remote desktop, clients will access Wayback directly. This means we need to do a few things:

  1. Modify the single-concurrent-use locking
  2. Limit how much text can be copied at once
  3. Use cache control headers to limit how much content gets stored on the client
  4. Prevent download of non-web content

Single-Concurrent-Use

The will be no login/logout hooks, so a simple alternative locking mechanism is proposed.

The default behaviour is that all 'top-level' URLs will be lock to a user's cookie session, set to time-out at midnight later that day. As before, transcluded items should not be locked. These locks are managed server-side.

To enable the lock to be released earlier, the lock can be polled and repeatedly updated from the Wayback JavaScript client, with a time-out set to a few minutes in the future. While a page is being viewed, it will still be locked to the current user, but once they move on it should time out in a few minutes as the lock is no longer being updated.

This means files that get downloaded will be locked for the whole day, but most pages should be released promptly.

Limit cut-and-paste

The client-side JavaScript should intervene during cut/copy events and limit the text to a configurable amount.

Limit local caching

The server should add headers to limit local caching, as per https://stackoverflow.com/questions/9884513/avoid-caching-of-the-http-responses -- this may be better done via NGINX?

Prevent downloads of non-web content

We need to try to prevent content being downloaded to local machines, and use a secondary service for rendering some formats to HTML.

First step is to intercept direct downloads of content other than HTML. These will then either be blocked (probably with a custom 451 error) or passed to an external service for rendering.

We will need some lookup table that maps Content Types to URL templates, e.g.

application/msword, http://service.things.com/url={url}

Or similar. When we hit a non-web type, we should open up the block page, and if there's a mapping, offer to redirect the user to that URL for access. For all types, we should ensure the Content-Disposition header is blocked so downloads can't be forced that way.

i.e. this is similar to the old Interject idea (source code & tech docs here).

@anjackson
Copy link
Contributor Author

Updated following clarification of download limits.

@ikreymer
Copy link
Contributor

To clarify, the default behavior is that a resource remains locked, unless it is explicitly unlocked by the same client, right?
Otherwise, the default lock will only be for a few minutes, until it is no longer being polled.

Eg. If client with cookie A locks https://example.com/, and then client with cookie A visits https://example.com/foobar, moving its lock to that url, and unlocking https://example.com/.
If the client then closes the browser, https://example.com/foobar remains locked until midnight?

OR

Each https://example.com/ is locked by A, then client moves to https://example.com/foobar and locks that. The lock for https://example.com/ expires after a few minutes. When user closes the session, the lock for https://example.com/foobar expires as well. If a user attempts to download a file (which would trigger an interstitial behavior as outlined), the lock is acquired and then expires. But under this approach, the lock would never last until midnight, since it would expire after no longer being polled by the client?

Or perhaps I am missing something?

@ikreymer
Copy link
Contributor

Regarding 4), one possible tricky edge case may be if a certain type of resource could be either downloaded or embedded in the page. I guess maybe the only example is PDF, unless there is a custom viewer for ms-word somewhere... We know that for certain that if Content-Disposition was present, it is a download, otherwise I think it is not possible to tell for sure...

And for PDFs, I think if you use the default PDF plugin the browser provides, it is still possible to download the PDF from there.. I don't think there's a way to prevent that from the default PDF viewer.

@anjackson
Copy link
Contributor Author

On locks, it's the second case. The idea was:

  • When visiting a URL, the lock initially has a timeout of midnight (set server-side)
  • If the polling from the client works, the lock timeout is shortened, but the client polling stops the lock from timing out, rather than explicitly releasing it.

So, the lock-till-midnight should rarely happen, as it is simply a fall-back in case for some reason the client-side locking protocol fails.

On (4), I was imagining we'd sniff the Content-Type from the WARC record on the server side and block/redirect if it's not HTML or an embed. If that makes sense?

ikreymer added a commit that referenced this issue Apr 16, 2020
- When pinged at /_locks/ping, the lock for the referring url is set to expire after LOCK_PING_EXTEND_TIME seconds
- Ping is set to happen every LOCK_PING_INTERVAL seconds
- Copy and/or selection of text is limited to SELECT_LIMIT_WORD words (if any) for single-use-lock collections
- config 'add_headers' option can be used to specify extra cache-control and other headers
- config 'content_type_redirects' option can be used to specify content-types for which to redirect to a custom viewer or block page,
and <any-download> also adds redirects for any response with 'content-disposition: attachment' set
- update docs/locks.md with new features
- update config.yaml with add_headers and content_type_redirects
- tests: update tests to test new features, add WARCs to test custom content-types and content-disposition headers
- selection limits via static/selection-limits.js
@ikreymer
Copy link
Contributor

Yep, kept the initial lock mechanism, but ping shortens the lock for the referring url.
First pass in the above commit.

Updated docs for the new features:
https://github.com/ukwa/ukwa-pywb/blob/2.4.0-beta/docs/locks.md#ping-session-refresh

@anjackson
Copy link
Contributor Author

Generally, looks good. Unfortunately, I think the content-type block will need some way to act more like a allow list than a block list. e.g. unknown or unspecified formats should not be downloadable, so we'll need some way of saying 'web formats allowed' (html/jpg/css/js/png/etc.).

Which is unpleasant but necessary.

ikreymer added a commit that referenced this issue May 1, 2020
content type redirects: support default block list, with specific allow rules, per #54
@ikreymer
Copy link
Contributor

ikreymer commented May 1, 2020

How about something like this?

        content_type_redirects:
          # allows
          'text/': 'allow'
          'image/': 'allow'
          'video/': 'allow'
          'audio/': 'allow'
          'application/javascript': 'allow'

          'text/rtf': 'https://example.com/viewer?{query}'
          'application/pdf': 'https://example.com/viewer?{query}'
          'application/': 'https://example.com/blocked?{query}'

          # default redirects
          '<any-download>': 'https://example.com/blocked?{query}'
          '*': 'https://example.com/blocked?{query}'

The content-disposition is checked first so always takes precedence, then exact match,
followed by the mime prefix (eg. text/) match, followed by '*' wildcard.
With '*' set to redirect, all unlisted mimes will be redirected.

If this makes sense, can expand it with more mime types.
May be more convenient to move to separate file from config.yaml

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants