You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Interesting. What about other types (e.g. application/rss+xml)? And what if more than one such link is provided (with the same type or different ones)? Which one should prevail?
Because use cases may vary, what about simply "following" all such links and letting users filter out those they do not want? Since the content is the same, I understand how crawling only one makes sense.
Maybe: if there are one or more links, we consider them in configurable order of preference (text/plain being first in your example). Would that work?
In the meantime, you can write a custom link extractor (or use the Regex one) to have those extracted/followed.
Yes, I agree. Things get complicated with multiple alternate links as, for different use cases, one may want to do different things.
I actually worked around by using DOM tagger to extract article text from the page so this not something I need at the moment. I just wanted to make a note of "alternate" elements as something to consider.
The use cases here seem very complicated. I think that the existing behavior of treating it as a link to be crawled makes sense. With taggers and what not you can mark documents that have a alternate and then handle it with a post-process that works with your committer.
I have a case where document (at the canonical URL) contains link to versions of the document in alternate formats. For example:
It would be useful to enable fetching of "text/plain" version of the document and use obtained text as document content.
The text was updated successfully, but these errors were encountered: