Crawl Rules Tips in SharePoint 2010
Manage Crawl Rules in SharePoint
SharePoint admin can include or exclude specific URL during the content crawling stage. The content in SharePoint will be crawled periodically so that that search index will be updated and users can view the latest search result quickly. Administrators can actually "modify" the search result by including or excluding more URLs such that specific content will be included or excluded respectively.
(* It sounds weird for me initially because it violates the concept of findability of public asset on a collaboration platform essentially. However, I do realize the business requirements and accept this truth after understanding the user requirements and business scenario from a real world customer).
To manage crawl rules in SharePoint 2010, you can follow:
SharePoint 2010 Central Administration > Application Mgmt > Service Apps > Manage Service App > Search Service App > Crawling > Crawl Rule
Regular Expression (RegEx) in Crawl Rule
Administrators can input a URL, a pattern or regex when managing crawl rule. I have a requirement to exclude all URLs with a suffix of "AllItems.aspx". However, it is possible that multiple places do have this page, e.g. DocLibA can have a page like "/DocLibA/Forms/AllItems.aspx" and DocLibB also have a page like "/DocLibB/Forms/AllItems.aspx".
In order to exclude a URL with this suffix, I need a pattern that replace DocLib and Forms. However, the DocLib can have multiple levels so the traditional pattern of astersk does not work because we have no clude about how many level do users create in the long run because: /*/AllItems.aspx is different from /*/*/AllItems.aspx
Therefore, the use of RegEx come up immediately and I need to use a wildcard like RegEx pattern with a specific suffix only. Therefore, I go look for some URL reference. However, I got another problem is that the RegEx pattern does NOT work in the same way as I wish because the forward slash "\" got distorted at all.
Distorted result: http://(/w+//forms//allitems.aspx)$
Finally, I have to use a pattern like this in order to exclude all allitems.aspx under whatever folder and whatever level: