xlsgen > overview > Fuzzy string matching

Fuzzy string matching allows to compare content in cells and ignore not only the case, but also spaces as well as punctuation signs. The reason behind this is that real data may differ just by a few characters such as spaces or punctuations and it's sub-optimal that Excel regards such strings as different whereas they are really the same.

With fuzzy string matching, "AB" = "A B" = "A,B" = "a-b" and so on.

This mechanism for comparing strings is through out xlsgen, whether in cell formulas, or elsewhere such as conditional formatting formulas, autofilter expressions, sort and so on.

By default, string matching has the strict behavior, which means strings match if everything is the same except the case, so "AB" = "ab", but "AB" != "A B".

By activating the string matching algorithm with the new enumeration that is being introduced,

typedef enum
{
  [helpstring("String match, strict")]                                          stringmatch_strict       = 0,
  [helpstring("String match, ignore punctuation differences")]      stringmatch_punctuations = 1,
  [helpstring("String match, ignore punctuation and letter swaps")] stringmatch_letters      = 2
} enumStringMatch;
workbook.FuzzyStringMatch = stringmatch_punctuations;

"AB" = "ab" and "AB" = "A B".

So once the FuzzyStringMatch property is set as above, calculations whether direct or indirect involving comparison operators such as =, <=, >=, <, > and != behave accordingly. Also when you are using functions doing string matching internally such as MATCH() and LOOKUP().

 

Fuzzy string matching example for spaces and punctuations

Considering the following variants of string "AB" :


Sample data

 

Those variants differ in the addition of one or more space characters, case as well as punctuation characters as well as character substitutions which are arguably not variants of "AB".

If you are in Excel, select this range and create a conditional formatting on it where the condition to turn its background as yellow whenever the following condition applies : A1="AB", then only the first row gets highlighted because Excel does a strict match.


Strict string matching in conditional formattings : only one row is highlighted

 

If xlsgen is used to create such conditional formatting, that would yield :

xlsgen::IXlsConditionalFormattingPtr cf001s0 = worksheet->NewRange(L"R3C3:R18C3")->NewConditionalFormatting();
cf001s0->FormulaCondition->Formula = L"A1=\"AB\"";
xlsgen::IXlsStylePtr style001s0 = worksheet->NewStyle();
style001s0->Pattern->Pattern = xlsgen::pattern_solid;
style001s0->Pattern->BackgroundColor = 0xFFFF00;
cf001s0->Style = style001s0;

And the result would be the same than in Excel, for instance if you export the sheet as a PDF file to get the conditional formattings evaluated and rendered.

Now if all those variants are just regarded the same as "AB" itself, there is no reason to lose them at all. And if you are using Excel, you would have to write a custom VBA macro function whose calculation would implement fuzzy string matching. xlsgen does it for you. All you have to do is add the following line of code before the conditional formattings get evaluated :

workbook.FuzzyStringMatch = stringmatch_punctuations;

And then, something magic occurs :


Fuzzy string matching in conditional formattings : all variants are highlighted

 

All variants of "AB" get highlighted, but "AC" and other combinations aren't for not being variants after all : character "C" isn't character "B".

Since Excel does not support fuzzy string matching, the resulting file, next time it's opened in Excel, will not highlight those rows. It's therefore important to understand this context of calculations exists when xlsgen is used for calculations and rendering, not Excel.

 

Fuzzy string matching example for letters as well

What if we could match strings differing not only on spaces and punctuations but on some sort of letter combinations such as substitution, insertion and deletion? Let's see an example of that :

           
Sample data   Excel match   xlsgen Fuzzy punct   xlsgen Fuzzy letters  

In this example, we start with a number of variations of string "abcdef" in which we have a number of spaces and punctuations and also swapped letters, missing letters and additional letters.

If in Excel we create a conditional formatting meant to highlight strings matching "abcdef", only the two first rows get highlighted in Excel. And for a reason, they differ only on the case. Add any space, punctuation let alone letters and Excel sees all of these as different strings.

If you are using xlsgen and the punctations mode for fuzzy string matching, then xlsgen highlights many rows but far from all rows. Specifically, xlsgen does not highlight rows where there is at least one letter combination such as swapping, insertion or deletion.

And if you are using xlsgen with the letters mode for fuzzy string matching, then xlsgen all of a sudden highlights almost all rows. In fact, the only which doesn't is because a letter appears three times which is regarded as too different.

Here is how you do this in xlsgen,

workbook.FuzzyStringMatch = stringmatch_letters;
xlsgen::IXlsConditionalFormattingPtr cf001s0 = worksheet->NewRange(L"R3C3:R16C3")->NewConditionalFormatting();
cf001s0->CellCondition->EqualTo(L"\"abcdef\"");
xlsgen::IXlsStylePtr style001s0 = worksheet->NewStyle();
style001s0->Pattern->Pattern = xlsgen::pattern_solid;
style001s0->Pattern->BackgroundColor = 0xFFFF00;
cf001s0->Style = style001s0;

What is this good for in practice? It turns out this is really useful to have such a versatile fuzzy string matching because letter substitutions are exactly what happens with manually entered data in the real world. Indeed, often letters are simply swapped, or they are missing letters or too many letters. Combined with supporting spaces, punctuations and case, this looks like the perfect tool for getting your job done as fast and accurately as possible.

This is indeed an additional tool in your arsenal and we are looking forward to seeing customers out there taking advantage of it for improving how they handle real world data, which is often full of errors.

 

xlsgen documentation. © ARsT Design all rights reserved.