Regex fun
Update: Oliver Sturm made a few great suggestions to improve the expression and also fixed a bug with eager matching. Click here to view the complete comment which explains what eager matching is and how to fix it if you run into the same issue!
Today I was working on a new functionality of the LLBLGen Pro code generator engines: user code region preservation. This feature allows template designers to specify area's in the generated code where developers can add their own code which is then preserved when the code is again generated. An example of this can be a custom property in a Customer class which returns the full name based on the existing FirstName and LastName properties. Using this technique avoids having to subclass generated classes to add functionality.
The obvious way to do this is by inserting a start marker and an end marker which mark the region which should be preserved. To be able to define those regions in different scopes, the regions will get a name, so when the template parser runs into a region statement in the template, placed there by the template author, it can look up the region in the current version of the generated code, and copy its contents over to the new version of the generated code.
For the start marker I had __LLBLGENPRO_USER_CODE_REGION_START in mind, and for the end marker __LLBLGENPRO_USER_CODE_REGION_END. Pretty basic. Placed inside comments these will be easy to find back and not very likely will they match with existing code, which is always the issue with markers in code . As the output is text (C# or VB.NET code or a code support file, like a .config file or any other output file the developer had in mind), it should be fairly easy to find back the markers and the regions by doing some string search voodoo, right?
So I opened my parser sourcecode and started working on the region finder code. As the current generated code isn't parsed by this parser, there is no token to nonterminal parser logic available for the generated code and because I'm raised with C, I thought "what the heck, just some string search routines will do fine.". However that's easier said than done. As the markers will be placed in C# or VB.NET code, the comment operator is unknown to the parser. Also, the full line on which the marker is placed has to be copied, so the search routine has to scan back to the first CRLF it runs into. When it finds the start marker, it has to scan further for the region name. This got out of hand pretty quickly.
As the parser itself is build with regular expressions, I knew what they could do. Looking at my string searcher code, I realized I had to do something drastic: try to do it with regex's. A feeling inside me said that it might even be possible to do it with 1 single regex. Well, let's see!
Consider this code snippet from the generated code which has a user code region and which should be preserved. It's from an OrderEntity class, which has an extra property for the customer name (also pay attention to the whitespace):
// __LLBLGENPRO_USER_CODE_REGION_START customProperties /// <summary> /// Gets the company name of the related customer entity. /// </summary> public string CustomerCompanyName { get { if(this.Customer==null) { return string.Empty; } else { return this.Customer.CompanyName; } } } // __LLBLGENPRO_USER_CODE_REGION_ENDHow to find such regions back in the code with 1 regex? Well, with this one (wrapped over multiple lines for readability)
"^[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_START (?<regionName>\w+)\r\n(.*\r\n)*?[ \t]*('+|/{2,}) __LLBLGENPRO_USER_CODE_REGION_END"It defines both VB.NET and C# comment operators, and uses a group match to find the region name back. It can handle empty regions and empty lines.
So how does my scanner now look like?
private void FindUserCodeRegions() { // use the compiled regex to find all regions. MatchCollection matchesFound = _userCodeRegionRegExp.Matches(_originalFileContents); foreach(Match matchFound in matchesFound) { // a region was found. get the name of the region string regionName = matchFound.Groups["regionName"].Value; if(_userCodeRegions.ContainsKey(regionName)) { // already there, skip. continue; } _userCodeRegions.Add(regionName, matchFound.Value); } }That's it! It finds all regions and stores them by name in a hashtable, prior to the execution of the template.
Moral of the story: if you have to do string searches, be sure to check out regular expressions and the .NET classes for regular expressions in the System.Text.RegularExpressions namespace. It's a little sad that the Group object doesn't have a 'Name' property, as you can give groups names in the expression itself, but that's minor.
Oh, and before I forget: the hard part is often to write the expressions themselves. Use one of the various on-line regex tester sites, The Regulator or fire up Snippetcompiler and write a few lines to see if your expression does what it should do.