How does Sharepoint 2013 crawl an external site

Hello all,

We would like to have our web app crawled by Sharepoint 2013. Within our web app I have already built a "Site map creator" (builds the common xml file) which is used by external search engines (Google, Bing, Yahoo...) to crawl the site and index the content.  I have also built a "Jump Page" (Basically an html page, that contains a links to all the individual records) creator, used by Google Search Appliance to crawl the site.

I set up a new Content Source which points to my external site and "jump page".  When I click perform full crawl it only indexes the one page (the jump page).  I figured that it would read the page and followed all the links to the other records on the site.

This doesn't seem to happen.  Am I doing something wrong?  Is this not how the SharePoint crawling works?

I can't seem to find anything on the net that shows me what the Sharepoint crawler is expecting?  Our site pages are all built dynamically using ids based on searching for specific values.  Which is why Sharepoint can't just crawl and find the pages on it's own, hence the creation of the "jump page" which has all the indexable urls built for the crawler.

Any help would be much appreciated!

Thanks very much in advance!

March 26th, 2015 3:22pm

Hi,

Based on your description, you cannot crawl an external site in SharePoint 2013.

Check things below:

1. Check if there is any error about the external site in the crawl log. If there is error, you can click the number of the error to see the details of the error.

2. If your crawl rule includes either of the following please uncheck them and test again. Crawl SharePoint content as http pages, Follow links on the URL without crawling the URL itself".

3. If the external site is a SSL enabled website, you can configure SharePoint to Ignore SSL warnings (
Open SharePoint Central Administration->Select Manage service applications, from the Application Management section->Select the Search Service Application. e.g. Search Service 1->Select Farm Search Administration->Toggle the Ignore SSL warnings to Yes)

The article below is about how to configure external site as content sources in SharePoint search

https://mohitvash.wordpress.com/2012/03/07/configure-external-site-as-content-sources-in-sharepoint-search/  


The article below is about SharePoint could not establish trust relationship for the SSL/TLS secure channel when crawling SSL enabled websites

http://www.infotext.com/help/sharepoint-could-not-estabilish-trust-relationship-for-the-ssltls-secure-channel-when-crawling-ssl-enabled-websites/

There is a similar case:

https://social.technet.microsoft.com/Forums/sharepoint/en-US/c4aa91df-d2fd-4c34-9d3c-796666a3509e/sharepoint-2013-could-not-crawl-external-php-cms-website?forum=sharepointsearch

Sara Fan

Best regards

Free Windows Admin Tool Kit Click here and download it now
March 27th, 2015 9:58am

Hi Sara, thanks for the reply.

Basically the problem is that SharePoint is not following the links on the page.  There aren't any errors since it is only indexing the single simple html page.

The page that I am using as a starting point is a simple html "jump page", which just consists of links to record pages, which I figured SharePoint would "jump" to these other links to get the other page information.  It's is only indexing the content on that page.

After the crawl happens, the logs display successful crawl, but only a count of 1 page indexed.

I am not using SSL.

Am I missing something?

Thanks!

March 27th, 2015 12:36pm

Hi,

As I understand, you want to crawl the site in the jump page and in the content source you just use the jump page as the starting point.

You cannot crawl the external site as you did. You should add these external sites in the content source, then these external sites will be crawled.


Best regards

Sara Fan

Free Windows Admin Tool Kit Click here and download it now
March 29th, 2015 10:10pm

Hi Sara,

  There is only 1 external site.  Which I add as a content source.  I reference the jump page as the starting point.

The jump page basically looks like this (http://www.acme.com/jumpage.html):

******************************************************************

<html>

<head><title>Jump page</title></head>

<body>

<a href="Record">http://www.acme.com/details.aspx?recid=1">Record 1</a>

<a href="Record">http://www.acme.com/details.aspx?recid=2">Record 2</a>

<a href="Record">http://www.acme.com/details.aspx?recid=3">Record 3</a>

<a href="Record">http://www.acme.com/details.aspx?recid=4">Record 4</a>

</body>

</html>

**************************************************************************

All pages are on the same external site.  All the SharePoint crawler is indexing the "jumppage.html" pages content and not following the record links to the other pages within the site to index those.

Does this make sense?
Thanks!

March 30th, 2015 10:04am

Hi,
 
I recommend to add the external site as starting point in the start address, and select the Only crawl within the server of each start address in the crawl setting to see how it works.

 
Best regards

Sara Fan

Free Windows Admin Tool Kit Click here and download it now
March 30th, 2015 11:00pm

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics