Crawling an same site, SP crawler cannot follow hyperlinks; FAST crawler can follow hyperlinks but report error:  Could not convert decimal number
Hi,
Recently we encountered an issue while crawling an simple internet site.

Background:
1. It is an anonymous access site, contains 1 homepage and 10 individual pages. Homepage contains 10 hyperlink which point to these 10 individual pages.
2. Crawler server is able to access these 10 pages (tested by IE)
3. Below is the http response from home page
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 25 Mar 2013 02:21:06 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Keep-Alive: timeout=60
Vary: Accept-Encoding
Content-Language: en
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Cache-Control: no-cache
Cache-Control: private
Content-Length: 571

<html oty_id="1" ptl_id="198" name="Photo archive"><body><a href="com67_index.obt?obt_id=250186">250186</a><a href="com67_index.obt?obt_id=250190">250190</a><a href="com67_index.obt?obt_id=276298">276298</a><a href="com67_index.obt?obt_id=316266">316266</a><a href="com67_index.obt?obt_id=269604">269604</a><a href="com67_index.obt?obt_id=269606">269606</a><a href="com67_index.obt?obt_id=330751">330751</a><a href="com67_index.obt?obt_id=330745">330745</a><a href="com67_index.obt?obt_id=330743">330743</a><a href="com67_index.obt?obt_id=330744">330744</a></body></html>

4. For one individual page, the http response is below
HTTP/1.1 200 OK
Server: nginx
Date: Mon, 25 Mar 2013 02:24:52 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Keep-Alive: timeout=60
Vary: Accept-Encoding
Content-Language: en
Expires: Thu, 01 Jan 1970 00:00:01 GMT
Cache-Control: no-cache
Cache-Control: private
Content-Length: 595

<html object="330751"><head><meta name="Title" content="FirstNameAAA LastNameBBB"/><meta name="Description"/><meta name="ImageUrl" content="http://www.xxx.com/fileroot/gallery/73114a.jpg"/><meta name="DirectUrl" content="http://www.xxx.com/mars/search.search?id=198p_aun_obt_id=330751"/><meta name="Created" content="2011-10-28"/><meta name="Keywords"/><meta
name="author" content="aaa bbb"/><body>Object 330751</body></head></html>

Everything looks fine so far.

Next we setup a content source in content SSA:
Type: Web
Start address: <the url of home page>
Crawl setting: Only crawl within the server of each start address
Also create a crawl rule (e.g. check crawl complex URL which contains ?)  that make sure the individual URL will be crawled.

Then we have issue with SharePoint crawler:
1. However, after full crawl, we only got the one success crawl, which is the home page. The SP crawler "refused" to follow the hyperlinks in homepage and crawl the 10 individual pages.
2. If we set one individual page URL as start address, it can be crawled successfully. So these is nothing wrong with, for example, the connection with individual sites.
3. As far as SP crawler reports the page is crawled, we can get it from search result.

And we also have *different* issue with FAST crawler
1. We tried FAST web crawler. In this case, FAST web crawler is able to find all pages (1 homepage + 10 individual pages) and I can see the document count in its brand new contentcollection become 11 from 0.
2. However(again!), if i run with indexerinfo, i found out these 11 "successfully crawled" document are in not_indexed count.
After future research, i found out FAST is able to extract crawled properties and generate FIM files, but failed to be indexed.
The log file D:\FASTSearch\data\data_fixml\doc_errors_TheNewSearch.dat contains
error as below:
"ec507eaa1973814c40a7c4fefb8c32b1 272 Aborted document during indexing at fixml file line 2189 column 51. Reason: AddDecimalNumber(bi1, bconsortdate, 1363244311) failed: Could not convert decimal number '1363244311' to an integer using 0 digit decimal precision. Hindexing".

The line 2189, column 51 in FIM is below
    <context name="bconsortdate"><![CDATA[1363244311']]></context>
it belongs to below section "bi1":
  <catalog name="bi1">
    <context name="bconprocessingtime"><![CDATA[2013-03-14T06:59:10Z]]></context>
    <context name="bcondocdatetime"><![CDATA[2013-03-14T06:58:29Z]]></context>
    <context name="bconsize">617</context>
    <context name="bconhwboost">10000</context>
    <context name="bcondocrank"><![CDATA[0]]></context>
    <context name="bconsiterank"><![CDATA[0]]></context>
    <context name="bconurldepthrank"><![CDATA[500]]></context>
    <context name="bconwrite"><![CDATA[2013-03-14T06:58:29Z]]></context>
    <context name="bcondocumentsignature"><![CDATA[369874431937263284]]></context>
    <context name="bconsortdate"><![CDATA[1363244311']]></context>

Let's summary:
1. Looks like the content of the page is correct, since SP crawler is able to crawler and index the content of the page as far as it was crawled.
2. Also the hyperlink looks fine as well, since FAST crawler is able to follow them, find and download all 10 pages.
3. Something wrong with SP Crawler - cannot follow up hyperlink as FAST crawler did.
4. Something with indexer when using FAST web crawler - reports Could not convert decimal number to integer; But using SP crawler we did not have this error.

Did anyone have similar issue before?

Many thanks,
Feng
March 25th, 2013 6:04am

In addition, we had ported all pages from this external site to a local IIS site, but changed the file extension to *.html instead of "com67_index.obt?obt_id =***". All contents in pages remain the same.

When we tried SP crawler with URL of the homepage in the local site, all pages are crawled and indexed successfully.

Maybe something wrong with the http response header from the external internet site?

-Feng
Free Windows Admin Tool Kit Click here and download it now
March 25th, 2013 6:09am

OK, now i am answering the issue.

1. Issue of SP web crawler cannot follow hyperlinks - MS is able to reproduce it in SP2010 and SP2013. Not sure if it is caused by the way the web site was built.
This one is still in progress.

2. Issue of FAST web crawler/indexer reports error:

-          In our FAST ranking profile, we are using a managed property sortdate as freshness. This managed property sortdate maps to a crawled property named sortdate.

-          For the site which has the problem, the pipeline will not put any value into crawled property sortdate (because it is a new site, we have not change anything in pipeline).

-          We have problem that FAST web crawl cannot put the pages from the site to index (it just put them in not_indexed)

-          After debug, we found in Fixml file, the indexer will report error as Aborted document during indexing at fixml file line 2198 column 51. Reason: AddDecimalNumber(bi1, bconsortdate, 1368763127) failed: Could not convert decimal number '1368776044' to an integer using 0 digit decimal precision. Meanwhile, by checking the spy file, the FAST crawler will put an OOTB crawled property named crawltime, which contains value like #### ATTRIBUTE crawltime <type 'int'>: 1368763127

-          Docpush is able to put the same page into index. Please note that in spy file of docpush, there is not crawltime crawled property.

-          By removing sortdate from ranking profile, the problem is gone.

-          If we create another managed property (e.g. sortdate2), and apply it to ranking profile, as far as sortdate2 contains NOTHING (do not map it to any crawled property, or map to an empty crawled property, e.g. sortdate), and re-crawl the site will produce the same error Aborted document during indexing at fixml file line 2193 column 51. Reason: AddDecimalNumber(bi1, bconsortdate2, 1368777820) failed: Could not convert decimal number '1368777820' to an integer using 0 digit decimal precision.

-          If map sortdate2 to crawltime which contains integer value , the problem is gone.

 

Guess root cause:

since sortdate is used by ranking profile, but when it contains no value, FAST will put another value from somewhere, but it will report above decimal convert to integer error.

 

Solution:

-          Map sortdate to crawled property crawltime as 2nd order (after crawled property sortdate). Even crawltime contains the same integer value in above error msg, it works. (Verified)

-          Make sure the pipeline always generate value for managed property sortdate (Verified)

-     Add a meta name sortdate to the html sources in order to feed a wished value to the Crawl Property when it is call in the document processor pipeline  (Not verified yet)



June 27th, 2013 11:30am

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics