The crawler will first check the robots.txt, if robots.txt is available for the root URL, based on the limitations in the robots.txt the URLs will be crawled starting from the root URL. Then other conditions set in collection settings such as remove duplicates, spider depth, redirects etc will be checked and the URL that matches the criteria would get indexed. Additionally, robot meta tags if available in the HTML page would be considered for indexing and crawling. Once the crawling is done in the first page i.e., root URL, the process will recursively continue with crawled links in each page till the spider depth mentioned in collection settings.
All the details related to settings and robot meta tags are available at
While indexing the content of the page excepting the elements between stopindex and startindex tags would be indexed under content.
Meta fields such as title, description, keyword, URL would be indexed as SearchBlox fields which can be searched directly as they would be included in the context field. Content and SearchBlox fields mentioned would be added to context for search.
Other custom meta tags would also be indexed, those fields can be viewed in your XML or JSON search response along with other Searchblox fields and can be searched using fielded search and filters. They can be added as facet filters or added to context search using the steps provided in the 3rd link below:
- Kindly note that in JSON or XML response you would be able to view the SearchBlox and meta fields. Content will not appear in the response.
- The search results can be tuned based on relevancy as in the help link below:
Please check all the topics in that page to get to know about boosting certain search results and relevancy tuning.