Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetching document text from alternate text/plain rel link #364

Open
radomirml opened this issue Jul 4, 2017 · 3 comments
Open

Fetching document text from alternate text/plain rel link #364

radomirml opened this issue Jul 4, 2017 · 3 comments

Comments

@radomirml
Copy link

I have a case where document (at the canonical URL) contains link to versions of the document in alternate formats. For example:

<link rel="canonical" href="http://www.example.com/content/57/2/119" />
<link rel="alternate" type="application/pdf" title="Full Text (PDF)" href="/content/57/2/119.full.pdf" />
<link rel="alternate" type="text/plain" title="Full Text (Plain)" href="/content/57/2/119.full.txt" />

It would be useful to enable fetching of "text/plain" version of the document and use obtained text as document content.

@essiembre
Copy link
Contributor

Interesting. What about other types (e.g. application/rss+xml)? And what if more than one such link is provided (with the same type or different ones)? Which one should prevail?

Because use cases may vary, what about simply "following" all such links and letting users filter out those they do not want? Since the content is the same, I understand how crawling only one makes sense.

Maybe: if there are one or more links, we consider them in configurable order of preference (text/plain being first in your example). Would that work?

In the meantime, you can write a custom link extractor (or use the Regex one) to have those extracted/followed.

@radomirml
Copy link
Author

Yes, I agree. Things get complicated with multiple alternate links as, for different use cases, one may want to do different things.

I actually worked around by using DOM tagger to extract article text from the page so this not something I need at the moment. I just wanted to make a note of "alternate" elements as something to consider.

@danizen
Copy link

danizen commented Mar 30, 2018

The use cases here seem very complicated. I think that the existing behavior of treating it as a link to be crawled makes sense. With taggers and what not you can mark documents that have a alternate and then handle it with a post-process that works with your committer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants