Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add robots.txt setting. #5584

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions docs/source/configuration/environmentvariables.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,6 @@ $ VOLTO_ROBOTSTXT="User-agent: *
Disallow: /" yarn start
```

```{note}
If you want to use the `VOLTO_ROBOTSTXT` environment variable, make sure to
delete the file `public/robots.txt` from your project.
```

### DEBUG

It will enable the log several logging points scattered through the Volto code. It uses the `volto:` namespace.
Expand Down
1 change: 1 addition & 0 deletions packages/volto/news/5580.feature
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Expose robots.txt setting in Volto control panel, and render robots.txt based on REST API call. @robgietema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is it exposed in the Volto control panel? I don't see it there. I expect it needs to be removed from

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, but related to the other comment:

image

The backend answer with the .gz, which we will override it afterwards in a shady way in code (btw, we are doing it right now)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sneridagh @ericof Let's change the sitemap to use sitemap-index.xml in the backend setting in plone/plone.volto#183

2 changes: 0 additions & 2 deletions packages/volto/public/robots.txt

This file was deleted.

1 change: 0 additions & 1 deletion packages/volto/src/config/ControlPanels.js
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,6 @@ export const filterControlPanelsSchema = (controlpanel) => {
'site_favicon_mimetype',
'exposeDCMetaTags',
'enable_sitemap',
'robots_txt',
'webstats_js',
],
editing: ['available_editors', 'default_editor', 'ext_editor'],
Expand Down
9 changes: 4 additions & 5 deletions packages/volto/src/express-middleware/robotstxt.js
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,12 @@ import { generateRobots } from '@plone/volto/helpers';
/*
robots.txt - priority order:

1) robots.txt in /public folder
davisagli marked this conversation as resolved.
Show resolved Hide resolved
2) VOLTO_ROBOTSTXT var in .env
3) default: plone robots.txt
1) VOLTO_ROBOTSTXT var in .env
2) robots.txt setting in the site control panel

*/

const ploneRobots = function (req, res, next) {
const siteRobots = function (req, res, next) {
generateRobots(req).then((robots) => {
res.set('Content-Type', 'text/plain');
res.send(robots);
Expand All @@ -27,7 +26,7 @@ export default function robotstxtMiddleware() {
if (process.env.VOLTO_ROBOTSTXT) {
middleware.all('**/robots.txt', envRobots);
} else {
middleware.all('**/robots.txt', ploneRobots);
middleware.all('**/robots.txt', siteRobots);
}
middleware.id = 'robots.txt';
return middleware;
Expand Down
2 changes: 1 addition & 1 deletion packages/volto/src/helpers/Api/Api.js
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ const methods = ['get', 'post', 'put', 'patch', 'del'];
* @param {string} path Path (or URL) to be formatted.
* @returns {string} Formatted path.
*/
function formatUrl(path) {
export function formatUrl(path) {
const { settings } = config;
const APISUFIX = settings.legacyTraverse ? '' : '/++api++';

Expand Down
42 changes: 13 additions & 29 deletions packages/volto/src/helpers/Robots/Robots.js
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
/**
* Sitemap helper.
* @module helpers/Sitemap
* Robots helper.
* @module helpers/Robots
*/

import superagent from 'superagent';

import config from '@plone/volto/registry';
import { formatUrl } from '@plone/volto/helpers/Api/Api';
import { addHeadersFactory } from '@plone/volto/helpers/Proxy/Proxy';

/**
Expand All @@ -15,41 +17,23 @@ import { addHeadersFactory } from '@plone/volto/helpers/Proxy/Proxy';
*/
export const generateRobots = (req) =>
new Promise((resolve) => {
const internalUrl =
sneridagh marked this conversation as resolved.
Show resolved Hide resolved
config.settings.internalApiPath ?? config.settings.apiPath;
const request = superagent.get(`${internalUrl}/robots.txt`);
request.set('Accept', 'text/plain');
const request = superagent.get(formatUrl('@site'));
request.set('Accept', 'application/json');
const authToken = req.universalCookies.get('auth_token');
if (authToken) {
request.set('Authorization', `Bearer ${authToken}`);
}
request.use(addHeadersFactory(req));
request.end((error, { text }) => {
request.end((error, { text, body }) => {
if (error) {
resolve(text || error);
} else {
// It appears that express does not take the x-forwarded headers into
sneridagh marked this conversation as resolved.
Show resolved Hide resolved
// consideration, so we do it ourselves.
const {
'x-forwarded-proto': forwardedProto,
'x-forwarded-host': forwardedHost,
'x-forwarded-port': forwardedPort,
} = req.headers;
const proto = forwardedProto ?? req.protocol;
const host = forwardedHost ?? req.get('Host');
const portNum = forwardedPort ?? req.get('Port');
const port =
(proto === 'https' && '' + portNum === '443') ||
(proto === 'http' && '' + portNum === '80')
? ''
: `:${portNum}`;
// Plone has probably returned the sitemap link with the internal url.
// If so, let's replace it with the current one.
const url = `${proto}://${host}${port}`;
text = text.replace(internalUrl, url);
// Replace the sitemap with the sitemap index.
text = text.replace('sitemap.xml.gz', 'sitemap-index.xml');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sneridagh @ericof Removing this replacement is a breaking change. We should keep it here, or handle it instead in plone/plone.volto#183

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, however I think it has to go to plone.volto too. Although that can be a breaking change, since it can mean an action needed in Google Search Consoles of people.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sneridagh No, it's not a breaking change. It was getting replaced before, and it still will be.

resolve(text);
resolve(
body['plone.robots_txt'].replace(
'{portal_url}',
config.settings.publicURL,
),
davisagli marked this conversation as resolved.
Show resolved Hide resolved
);
}
});
});