How Crawler works in SearchBlox and how indexing takes place?

The crawler will first check the robots.txt, if robots.txt is available for the root URL, based on the limitations in the robots.txt the URLs will be crawled starting from the root URL. Then other conditions set in collection settings such as remove duplicates, spider depth, redirects etc will be checked and the URL that matches the criteria would get indexed. Additionally, robot meta tags if available in the HTML page would be considered for indexing and crawling. Once the crawling is done in the first page i.e., root URL, the process will recursively continue with crawled links in each page till the spider depth mentioned in collection settings.

All the details related to settings and robot meta tags are available at

https://developer.searchblox.com/docs/http-collection#section-collection-settings

https://developer.searchblox.com/docs/http-collection#section-metatags-customization 

While indexing the content of the page excepting the elements between stopindex and startindex tags would be indexed under content. 

Meta fields such as title, description, keyword, URL would be indexed as SearchBlox fields which can be searched directly as they would be included in the context field. Content and SearchBlox fields mentioned would be added to context for search. 

Other custom meta tags would also be indexed, those fields can be viewed in your XML or JSON search response along with other Searchblox fields and can be searched using fielded search and filters.  They can be added as facet filters or added to context search using the steps provided in the 3rd link below:

https://developer.searchblox.com/docs/cheat-sheet#section-fielded-search 

https://developer.searchblox.com/docs/filters 

https://developer.searchblox.com/docs/custom-fields-in-search 

Additional Pointers

  • Kindly note that in JSON or XML response you would be able to view the SearchBlox and meta fields. Content will not appear in the response.
  • Regarding the content that gets picked by the indexer, it should be a static content and should not get generated dynamically via javascript.
  • The search results can be tuned based on relevancy as in the help link below:

https://developer.searchblox.com/docs/relevancy-tuning-in-search 

Please check all the topics in that page to get to know about boosting certain search results and relevancy tuning.

Have more questions? Submit a request

Comments