Fetching document text from alternate text/plain rel link #364

radomirml · 2017-07-04T20:43:06Z

I have a case where document (at the canonical URL) contains link to versions of the document in alternate formats. For example:

<link rel="canonical" href="http://www.example.com/content/57/2/119" />
<link rel="alternate" type="application/pdf" title="Full Text (PDF)" href="/content/57/2/119.full.pdf" />
<link rel="alternate" type="text/plain" title="Full Text (Plain)" href="/content/57/2/119.full.txt" />

It would be useful to enable fetching of "text/plain" version of the document and use obtained text as document content.

The text was updated successfully, but these errors were encountered:

essiembre · 2017-07-04T23:54:02Z

Interesting. What about other types (e.g. application/rss+xml)? And what if more than one such link is provided (with the same type or different ones)? Which one should prevail?

Because use cases may vary, what about simply "following" all such links and letting users filter out those they do not want? Since the content is the same, I understand how crawling only one makes sense.

Maybe: if there are one or more links, we consider them in configurable order of preference (text/plain being first in your example). Would that work?

In the meantime, you can write a custom link extractor (or use the Regex one) to have those extracted/followed.

radomirml · 2017-07-05T11:41:20Z

Yes, I agree. Things get complicated with multiple alternate links as, for different use cases, one may want to do different things.

I actually worked around by using DOM tagger to extract article text from the page so this not something I need at the moment. I just wanted to make a note of "alternate" elements as something to consider.

danizen · 2018-03-30T19:32:44Z

The use cases here seem very complicated. I think that the existing behavior of treating it as a link to be crawled makes sense. With taggers and what not you can mark documents that have a alternate and then handle it with a post-process that works with your committer.

essiembre added the feature-request label Jul 4, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetching document text from alternate text/plain rel link #364

Fetching document text from alternate text/plain rel link #364

radomirml commented Jul 4, 2017

essiembre commented Jul 4, 2017

radomirml commented Jul 5, 2017

danizen commented Mar 30, 2018

Fetching document text from alternate text/plain rel link #364

Fetching document text from alternate text/plain rel link #364

Comments

radomirml commented Jul 4, 2017

essiembre commented Jul 4, 2017

radomirml commented Jul 5, 2017

danizen commented Mar 30, 2018