Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation for Auth - OKTA SSO #421

Open
jacksonp2008 opened this issue Nov 5, 2017 · 16 comments
Open

Documentation for Auth - OKTA SSO #421

jacksonp2008 opened this issue Nov 5, 2017 · 16 comments

Comments

@jacksonp2008
Copy link

Testing this product, so far so good. I will have a number of sites that use OKTA SSO which I will need to crawl. Any pointers on how to do this?

@essiembre
Copy link
Contributor

I am not too familiar with OKTA SSO, but I can see they support a few authentication methods. Maybe one is supported by the GenericHttpClientFactory. If not, you can extend this class to provide your custom authentication mechanism using OAuth 2.0 and carrying your security token around. You can have a look at https://www.norconex.com/how-to-crawl-facebook/ which does something similar.

If the authentication for your sites is standard/generic enough and you can share credentials for a sample site, maybe we'll be able to add built-in support for another authentication scheme.

@jacksonp2008
Copy link
Author

Thanks. OKTA is not social auth, rather an application for enterprises to provide single sign-on via SAML for web applications. https://developer.okta.com/use_cases/integrate_with_okta/sso-with-saml

It may be possible to send username/password, and 2FA. I'll do some research on this on the OKTA side.

Can you please point me to how I would extend GenericHttpClientFactory for this? Generally, how would I send a simple username/password to a site and then crawl it?

@essiembre
Copy link
Contributor

I did not mean to suggest OKTA was a social auth. The link to the blog post was to point you to an example extending the Collector. There is an example of a class extending GenericDocumentFetcher showing one way you can pass a token with every URL requests. On second thought, it may be the best class to overwrite since that is actually where the HTTP requests are happening.

Extending GenericHttpClientFactory could be good if you know there is a way to add "default" SAML authentication (or other supported auth) on Apache HttpClient.

It seems like OKTA provides a Java API which should make your life easier: https://developer.okta.com/code/java/index

If you have a way to provide me with a protected URL with a temporary test account (sent by email), we could make adding SAML support a feature request if you like.

@jacksonp2008
Copy link
Author

jacksonp2008 commented Nov 9, 2017

Very interested and thank-you! -- I need to do some work on my end.

OKTA does offer a free test account, might help. I am totally open to getting you access to test.

I've got some basic auth sites to figure out first... tried one below, inside <httpcollector... I have:

  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <authUsername>super secret account</authUsername>
      <authPassword>super secret password</authPassword>
      <authUsernameField>super secret account</authUsernameField>
      <authPasswordField>super secret password</authPasswordField>
   <authMethod>form</authMethod>
  </httpClientFactory>

Cred's are good, but it returns:

INFO  [CrawlerEventManager]       REJECTED_BAD_STATUS: https://updates.forescout.com/ (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Authorization Required])

Must be missing something. tried <authMethod>
form
basic
digest

no love, but very close.

@essiembre
Copy link
Contributor

Can you share test credentials for that one too? From a very quick look at the site, it seems to be using "basic" (or maybe "digest"). I do not know if that's related, but there is a new flag that fixes basic auth issues for some people, described here: #420 (comment)

Here is the flag:

<authPreemptive>true</authPreemptive>

Give it a try and let me know.

@essiembre
Copy link
Contributor

Marking as a feature-request to support SAML.

@jacksonp2008
Copy link
Author

Just getting back to this, thank-you Pacal. I did add the authpreemptive but it didn't fix. I'm not sure how to debug further, is there a way to enable verbose logging so I can see what's going on behind the scenes. If I can't figure it out, I may take you up on your kind offer to login.

@jacksonp2008
Copy link
Author

jacksonp2008 commented Nov 14, 2017

Just fyi, I setup postman and did a "basic" auth request with the credentials and it works fine but with the crawler it fails.

Here is my config:

<httpcollector id="Minimum Config HTTP Collector">
  <httpClientFactory class="com.norconex.collector.http.client.impl.GenericHttpClientFactory">
      <authUsername>xxxxx</authUsername>
      <authPassword>xxxxxxx</authPassword>
      <authUsernameField>xxxxx</authUsernameField>
      <authPasswordField>xxxxxx</authPasswordField>
      <authPreemptive>true</authPreemptive>
      <trustAllSSLCertificates>true</trustAllSSLCertificates>
      <authMethod>basic</authMethod>
      <authURL>https://updates.forescout.com</authURL>
  </httpClientFactory>

and log:

INFO  [SitemapStore] Anonymous Coward: Initializing sitemap store...
INFO  [SitemapStore] Anonymous Coward: Done initializing sitemap store.
log4j:WARN No appenders could be found for logger (org.apache.http.client.protocol.ResponseProcessCookies).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
INFO  [StandardSitemapResolver] Resolving sitemap: https://updates.forescout.com/sitemap.xml
INFO  [StandardSitemapResolver]          Resolved: https://updates.forescout.com/sitemap.xml
INFO  [HttpCrawler] 1 start URLs identified.
INFO  [CrawlerEventManager]           CRAWLER_STARTED
INFO  [AbstractCrawler] Anonymous Coward: Crawling references...
INFO  [CrawlerEventManager]       REJECTED_BAD_STATUS: https://updates.forescout.com/ (HttpFetchResponse [crawlState=BAD_STATUS, statusCode=401, reasonPhrase=Authorization Required]

@essiembre
Copy link
Contributor

essiembre commented Nov 15, 2017

To get the maximum verbosity set the following to TRACE in log4j.properties:

log4j.logger.org.apache.http=TRACE

If you do not mind sending me temporary credentials via email, I will look into it when I get a chance.

@essiembre
Copy link
Contributor

Having a second look at your log, I see it rejects the authentication instead of attempting it. Odd... maybe as a test you can try to set the following in your document fetcher:

<validStatusCodes>200,401</validStatusCodes>

Let me know if that makes a difference.

@jacksonp2008
Copy link
Author

jacksonp2008 commented Nov 15, 2017

Sorry, that didn't do it, although I do have a clue.

If I use curl:

curl -u user:pas$word https://updates.forescout.com

if fails with the 401 even though the creds are good.

Add \ in front of $ and it works:

curl -u user:pas\$word https://updates.forescout.com

Is it possible that we are seeing an issue with special characters in the password?

I did try several ways, none worked out:

'pas$word'
"pas$word"
pas$word

thanks

@essiembre
Copy link
Contributor

Interesting... a possibility for sure. Are you storing your password in the Collector XML config? If so, did you try escaping the $ with a backslash in the config? It may be interpreted as a variable. If that's the case and it works with escaping, you have other options too. You can define the password in a variables/properties file and reference it as a variable in the config, or you can encrypt the password (see online documentation).

If that does not make a difference, it may need to be escaped by the Collector somehow before sending it to your server and will require more investigation.

As a workaround solution, does it work if you change the password to one without $ in it? Just to 100% confirm the issue is with the $.

@essiembre
Copy link
Contributor

essiembre commented Nov 17, 2017

Good news: it is working as expected and I was able to make it work without anything special. It turns out you have put <httpClientFactory> under <httpcollector> while it goes under your <crawler> section (as per documentation). Moving it there did it.

Please have a try and confirm.

@jacksonp2008
Copy link
Author

Awesome news, it ran here as well, thanks for all your help on this Pascal!

@kristiWabion
Copy link

Hello, any update on using SSO Auth with Norconex?

@jacksonp2008
Copy link
Author

jacksonp2008 commented Nov 10, 2018 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants