The Crawler works as provided in the steps below for HTTP Collection:
- The crawler will first check the robots.txt. If robots.txt is available for the root URL, based on the limitations specified in the robots.txt the URLs will be crawled starting from the root URL.
- Then other conditions set in collection settings such as remove duplicates, spider depth, redirects, etc will be checked and the URL that matches the criteria would get indexed.
- Additionally, robot meta tags if available on the HTML page would be considered for crawling.
- Once the crawling is done on the first page i.e., root URL, the process will recursively be continued with other URLs in each web page till the spider depth mentioned in collection settings.
- While indexing the content of the page excepting the elements between stopindex and startindex tags would be indexed under content.
- Meta fields such as title, description, keyword, URL would be indexed as SearchBlox fields which can be searched directly as they would be included in the context field. Content and SearchBlox fields mentioned would be considered to generate a context for the search.
- Other custom meta tags would also be indexed, those fields can be viewed in your XML or JSON search response along with other Searchblox fields and can be searched using fielded search and filters. They can be added as facet filters or added to context search.
- Read: Fielded Search in SearchBlox
- Read: Custom Fields in Search
- Kindly note that in JSON or XML response you would be able to view the SearchBlox fields as well as meta fields. Content will not appear in the response.
- The search results can be tuned based on relevancy as in the help link below: https://developer.searchblox.com/docs/relevancy-tuning-in-search
- Please check all the topics on that page to get to know about boosting certain search results and relevancy tuning.
To learn more about HTTP collection read: HTTP Collection in SearchBlox