A Hardcore Declaration of Independence (cont'd)

Appendix A. The Dictionary Class

What exactly is a Dictionary? Well, according to what passes for documentation on the subject, "A Dictionary object (sic) is the equivalent of a PERL (sic) associative array." So why aren't we all using Perl instead of reading crap from someone who can't spell the name of the language and doesn't know the difference between a class and an object?

I tried to make sense out the documentation, but alas there's no sense to be made. You should write the docs off as a total loss. They leave gapping holes in the description of the class members, and give no context to explain the purpose. The few examples are in VBScript, and are thus perfect examples of how not to write VB code. The only way you're going to figure out what the Dictionary class is all about is to learn how Perl associative arrays work. And once you learn that, you'll see that the Dictionary class is hardly an equivalent to real associative arrays--although its features are similar.

The problem is that Perl associative arrays work in the full context of the Perl language‑-which is optimized for text and array processing. Visual Basic has added some new Perl-like language features and tools, but they fall far short of offering Perl's power to process files and other text with a few lines of code. But once you get the idea, it's relatively easy to add new Perl-like tools that do what the built-in ones fail to do.

Note: Logically, Appendix A fits in Chapter 4 after the end of "The Collection Class" and before "What Makes a Collection Walk?" (page 196). It is located here because it is too long for the page-by-page corrections section.

Not Just a Better Collection
The first thing you should understand about the Dictionary class is that it's not just an attempt to fix the obvious problems of the Collection class (see Chapter 3, "The Collection Class," page 191). The specification for VB5 originally had a Dictionary class. Whenever I complained about the failings of the Collection class, I was told that those problems would be fixed in Dictionary, so there was no need to fix Collection. Then they ran out of time to complete Dictionary, and Collection didn't get fixed either.

So when Dictionary made its appearance in VB6, I was expecting a better Collection. Not so. There is little connection or overlap between the two classes. The people who wrote this class for VBScript apparently had something very different in mind. If you try to use a Dictionary like a Collection, you'll get nothing but frustration. But a little frustration never hurt anyone. Let's try using a Dictionary as if it were a Collection.

But first, before you even think about using Dictionary, you have to check Microsoft Scripting Runtime in the References dialog. The Dictionary class is not part of the VB6 run-time library. It's a part of the Microsoft Scripting Host that happens to be shipped with VB6. You can use it just as easily with VB5 or even with VB4. If you have Windows 98 or the Windows NT Option Pack, you have the Dictionary class (as well as the FileSystemObject class). If not, you can download the Scripting Host from http://msdn.microsoft.com/scripting/. This means that if you use Dictionary, you'll have to ship the Microsoft Scripting Runtime (SCRRUN.DLL) with your product. You can't count on all your users having it.

Let's start by putting some items in a Dictionary:

Dim animals As New Dictionary
With animals
    .Add "Lion", "fierce"
    .Add "Tiger", "cunning"
    .Add "Bear", "savage"
    .Add "Shrew", "bloodthirsty"
    .Add "Weasel", "ruthless"
End With

The syntax for Add looks kind of like Collection Add, but it's not. The first argument is the key, and the second is the item, which is the reverse of the Collection order. You'll see another difference when you iterate through a Dictionary:

Dim animal As Variant
For Each animal In animals
    Debug.Print animal
Next

This prints: Lion, Tiger, Bear, Shrew, Weasel. At first glance this might seem normal, but when you look closer, you'll see that you're printing the keys, not the items. A Collection would print: fierce, cunning, savage, bloodthirsty, ruthless. This seems a little weird. Why won't it print the items? Well, it will if you ask it the right way:

Dim character As Variant
For Each character In animals.Items
    Debug.Print character
Next

How does this work? Well, it turns out that a Dictionary has two array properties. There's a Keys method containing all the keys, and there's an Items method containing all the items. A Collection only has an Items method. You can't get at the keys of a Collection--one of the flaws that I flamed about in Chapter 4. Dictionary fixes that by providing a Keys method. For Each blocks iterate Keys rather than items because this is the method used by the hidden NewEnum property (see Chapter 4, "What Makes a Collection Walk," page 198). In other words this:

For Each animal In animals

Actually means:

For Each animal In animals.Keys

This seems like a weird change if you expect Dictionary to be like Collection, but the difference will eventually make sense.

You might wonder whether indexing in the Dictionary class is one-based as in the Collection class or zero-based as in the Forms or Controls collections. Dictionary has a Count property like Collection's, so maybe it's one-based:

Dim i As Long
For i = 1 To animals.Count
    Debug.Print animals(i)
Next

This code compiles and runs, but it doesn't print anything. Apparently the Dictionary class isn't one-based. In fact, it isn't indexed at all. If you look at the locals window while you execute this code, you'll see that for each iteration a new element is added to the Dictionary with the keys 1, 2, 3, 4, and 5. The value for each is empty. Apparently any access of an Item (the default member) through a key adds that key to the Dictionary. You may not guess the intention of this behavior (and the documentation doesn't say), but it certainly isn't like a Collection. By the way, Dictionary has an Exist member so that you can check for a key to avoid adding those empty Dictionary items.

If you look at the Keys example in the documentation, it appears to iterate by index, but if you look more carefully you can see that the array returned by the Keys method is assigned to another array, and that array is in turn iterated by index. Why would you want to do that? Patience. We'll get to that.

Another difference from a Collection is that you can change on item with the assignment operator:

animals.Item("Lion") = "ferocious"  ' Full syntax
animals("Bear") = "wild"            ' Shortcut using default member

The reason this works with Dictionary is the same reason it ought to work with Collection. Dictionary implements Let and Set properties for its default Item property. Collection doesn't. This syntax can change an existing item, or create a new one. In other words, the Add method is redundant. I usually add new members by assignment:

animals("Mouse") = "timid"
' Same as animals.Add "Mouse", "timid"

To mention a few other differences, Dictionary has a CompareMode property with text and binary settings so that you decide whether "Tiger" and "tiger" are the same key. It also has a RemoveAll method so that you can clear a Dictionary quickly--an operation that is surprisingly tricky with Collection. Finally, it has a Key property which allows you to change the key of an existing item. Key deserves a special note as the first write-only property in history--or at least the first I've encountered. Of course an indexed Property Get for Key would be meaningless since you would have to supply the key to get the key, but I still think they should have supplied a Get to be orthogonal.

Given all these features, you can see that the Dictionary class is quite different than the Collection class as it now stands--although I doubt it would take much to add the same features to Collection. But knowing the Dictionary features doesn't give you much clue about what it's for. I happen to know what it's for because I spent several hours studying Perl associative arrays. It's a shame that you have to learn another language to figure out what the documentation should tell you, but that's how it is.

Associative Arrays and Dictionaries

A VBScript Dictionary object (or a Perl associative array) is simply a table of names (called keys) and values. If you take the name literally, a Dictionary data sample might looks this:

Keys

Items

a

the first letter of the English alphabet

aardvark

a large burrowing nocturnal African animal

Aaron

a brother of Moses and high priest of the Hebrews

aba

a fabric woven from the hair of camels or goats

abaca

a fiber obtained from the leafstalk of a banana

Surely, you think, there must be more to a Dictionary than that. But no. That's it. You tell it the key; it tells you the value.

In Perl, the items are always strings, but in Visual Basic items are Variants and can contain any Variant subtype including objects and, starting in version 6, UDTs. This means that a Dictionary can serve as a simple database for those occasions when a real database is overkill.

In Perl, associative arrays work closely with regular arrays, which work closely with strings. There's a whole set of operators and functions to process strings and arrays. The most useful with Dictionaries are filter functions that take some input--a string, a file name, an array--and returns an array of strings (technically an array of Variants containing Strings). VB6 adds two of these, Filter and Split, but serious Dictionary work requires all kinds of filters. I had to write several others in order to illustrate a practical way of creating a useful mini-database in a Dictionary.

Here's the code to create a database of all the words in a text file with the number of times each word occurs. This code is from the TDictionary.vbp project, and it counts the words in the TDictionary.frm file:

Dim dicWordCount As New Dictionary
Dim avsWords() As Variant, i As Long, sKey As String, sSep As String
' Separate with all the symbols used in VB programming
sSep = sWhitePunct & sQuote2 & sQuote1 & "()#<>+-*/=&\:;"
' Read the words from a file into an array
avsWords = GetFileTokens("TDictionary.frm", sSep)
' Put all the words into a table, counting each
For i = 0 To UBound(avsWords)
    sKey = avsWords(i)
    ' Increment count when a word is encountered
    dicWordCount(sKey) = dicWordCount(sKey) + 1
Next

The sWhitePunct constant used here comes from the Windows API Type Library and includes white space characters and punctuation characters. The GetFileTokens function is from the Parse module in VBCore. We'll look at the code for it later.

After executing this code, you have a Dictionary full of data, but using that data for anything useful is a separate problem. For one thing the data will be unsorted. Perl documentation clearly states that associative arrays have an undefined order. VB Dictionary documentation doesn't say anything about order, but I have found by experience that the order when iterated with For Each is the order in which items were added. I wouldn't count on this order even if it were meaningful. In any case, you'll probably want to have sorted data for most reports. You might also like to apply other operations--such as filtering out all the numeric tokens.

In order to see different views of your data, you don't need to change the order or content of the Dictionary. You just need to get a different view of the keys, and then use the keys to look at the corresponding items. They tell me that this is what database programmers do with queries, but as the only Visual Basic programmer in the world who has never used a RecordSet, I wouldn't know. Here's the code to apply two filters to the keys of the word count Dictionary:

Dim vas As Variant
' Sort the keys
vas = SortKeys(dicWordCount.Keys, fDescend, cmp)
' Filter all numbers out of the list using a Like pattern
vas = FilterLike(vas, "#*", False)

The first function sorts the keys and returns them in an array. The fDescend and cmp arguments determine whether the sort is ascending or descending, and whether the sort comparisons are binary or text (vbCompareBinary or vbCompareText). These variables are linked to Descending and Text checkboxes on the test form. SortKeys is in the Sort module in VBCore.

The second filter is based on my FilterLike function from the Utility module of VBCore. FilterLike is a more powerful version of VB's new Filter function. Filter can be used to filter out or in all the strings from an array strings that contain or don't contain a given substring. For example, I could have applied Filter like this to get all the strings containing the substring "as" in either upper- or lowercase:

vas = Filter(vas, "as", True, vbTextCompare)

FilterLike does the same thing except that it uses the patterns recognized by Visual Basic's Like operator.

I showed the two filters separately for clarity. It's actually more efficient to chain them together like this:

vas = FilterLike(SortKeys(dicWordCount.Keys, fDescend, cmp), _
                 "#*", False)

Once you have the keys you want in the order you want, it's easy to use the corresponding items:

 
' Output the table contents
Dim sOut As String
For i = 0 To UBound(vas)
    sKey = vas(i)
    sOut = sOut & sKey & " : " & dicWordCount(sKey) & sCrLf
Next

The items here are integers, but it's not hard to see that you could process Dictionary data full of objects or UDT variables in a similar way. The TDictionary project has another example in which file data (length, date, attributes, and so on) are stored in Variant UDTs in a Dictionary.

Incidentally, notice that we always use the Keys property to access the Items. We never access the items directly. That's probably why the internal enumerator that works with For Each iterates over the keys rather than the items. But frankly, there's not much reason to ever iterate through keys or items directly because their order is usually random.

Arrays, Strings, and Variants

All the filters shown so far--GetFileTokens, SortKeys, Filter, and FilterLike--return a Variant rather than an arrays of strings or even an array of Variants. This is a bad design choice, but one forced on us by the origin of the Dictionary class and the Filter and Split functions. They came from VBScript, which recognizes only one type--Variant. This is fine for VBScript, but an efficient VB6 Dictionary class would return a String array for Keys, a String for Key, a Variant array for Items, and so on.

But the more you check the actual types used by the Dictionary class, the more surprises and anomalies you'll find. Most methods and properties use Variants, even when that type seems inappropriate. My first thought was that this was a requirement for VBScript, but in fact there are a few specific types used in Dictionary. The Exist method returns Boolean, the Count property returns Long, and the CompareMode property returns a CompareMethod enum.

Even more surprisingly to me was what I found in some of the Variants. For example, the Keys method returns a Variant containing an array of Variants. If you're used to associative arrays in Perl or Dictionaries in the C++ Standard Template Library, you would probably assume that the Variants in the Keys array would contain Strings. Not so. You can actually add keys of any type--Date, Currency, whatever--and they will be stored internally in the entered type. This is a curious fact, but I don't think a very relevant one. Perhaps some reader can think of a good reason to use a type other than String for keys, but I can't. All of my filter functions assume String keys.

As an example of how to write a filter function, let's take a look at my Splits function from the Parse module:

Function Splits(sExpression As String, _
                Optional sDelimiters As String = sWhiteSpace, _
                Optional cMax As Long = -1) As Variant
    Dim sToken As String, asRet() As String, c As Long
    ' Error trap to resize on overflow
    On Error GoTo SplitResize
    ' Break into tokens and put in an array
    sToken = GetToken(sExpression, sDelimiters)
    Do While sToken <> sEmpty
        If cMax <> -1 Then If c >= cMax Then Exit Do
        asRet(c) = sToken
        c = c + 1
        sToken = GetToken(sEmpty, sDelimiters)
    Loop
    ' Size is an estimate, so resize to counted number of tokens
    ReDim Preserve asRet(0 To c - 1)
    Splits = asRet
    Exit Function
    
SplitResize:
    ' Resize on overflow
    Const cChunk As Long = 20
    If Err.Number = eeOutOfBounds Then
        ReDim Preserve asRet(0 To c + cChunk) As String
        Resume              ' Try again
    End If
    ErrRaise Err.Number     ' Other VB error for client
End Function

First let's look at the signature. This code returns a Variant, even though the data being returned is strings. VB6 adds the ability to ability to return arrays of any type from functions or properties, but we're ignoring that new feature. VB6 also adds the ability to assign one array to another, but we're not using that one either. You could return arrays in Variants and assign arrays to Variants in VB5, and since that's what the Dictionary does, it's what I do. This has the extra advantage of making my filter functions portable to VB5.

You might ask why I'm writing a Splits function instead of using the new Split function provided in VB6. At first glance, the signatures of Split and Splits look very similar. Unfortunately, Split can only "parse" using a single character separator rather than a list of separators. The "Split Sucks" section in Appendix B has a few more choice words about the uselessness of this function and its author.

The Splits code itself isn't very interesting. We've already seen the overflow resizing technique in Chapter 4, "Vectors as Resizeable Arrays," page 179, and the GetToken function in Chapter 5, "Code Review," page 253. What is interesting is how easy it is to write the GetFileTokens function once we have the Splits function:

Function GetFileTokens(sFile As String, _
                       Optional sDelimiters As String = sWhiteSpace _
                       ) As Variant
    ' File to string to token array
    GetFileTokens = Splits(MUtility.GetFileText(sFile), sDelimiters)
End Function

Here is a list of filter functions from VB and VBCore:

Function

Module

Description

GetFileTokens

Parse

Splits the contents of a text file into a token array

GetFileLines

Utility

Splits the contents of a text file into a line array

FileArray

Utility

Returns array of file names matching a file spec

Splits

Parse

Splits a string into a token array

QSplits

Parse

Splits a string into a quoted token array

Split

VBA

Fails to split a string into a token array

Filter

VBA

Filters elements in or out of array using substring compares

FilterLike

Utility

Filters elements in or out of array using the Like operator

SortKeys

Sort

Returns a sorted version of array

ShuffleKeys

Sort

Returns a shuffled version of array

ReverseKeys

Utility

Returns a reversed version of array

VB5 doesn't have the Filter and Split functions--or at least it didn't until now. They're defined in a conditional block (#If iVBVer <= 5) in the new VB6Funcs.bas module. In fact, VB5 users get a Split function that actually works. It uses code similar to the Splits function shown earlier. The new Join function is also there, as is an inefficient implementation of InStrRev.

The new array features in VB6 raise some interesting questions about the relationship between arrays and variants. Consider the following declarations:

Dim avsT() As Variant, asT() As String
Dim vasT As Variant, vavT As Variant
ReDim avsT(0 To 2) As Variant   ' Array of variants containing strings
ReDim asT(0 To 2) As String     ' Array of strings
ReDim vasT(0 To 2) As String    ' Variant containing array of strings
ReDim vavT(0 To 2) As Variant   ' Variant containing array of variants

They might all contain exactly the same data, but does that mean that any one can be assigned to any other? No, a test shows that some combinations are illegal:

'avsT = asT  ' Can't assign string array to variant array
'avsT = vasT ' Can't variant with string array to variant array
'asT = avsT  ' Can't assign variant array to string array
'asT = vavT  ' Can't assign variant with variant array to string array

When you think about this for a minute, the results make sense. You can't assign Variant arrays to String arrays or vice versa. But you can assign any kind of array to a Variant, because the Variant automatically resizes to accommodate it regardless of the prior contents. You can also assign one String or Variant array to another of the same kind, even if the array being assigned is stored in a Variant. This gets confusing, but only when assigning to arrays. If you assign to Variants you're always OK.

VB5 has even more limitations. You can't assign any array to another array, but you can assign arrays to Variants. The Sort module has some inefficient VB5 hacks in conditional blocks to work around these limitations.

What does all this unnecessary Variant conversion do to performance? Nothing good. I added a test to the TimeIt application to compare different kinds of array returns. Here are the results:

Return array of strings from function: 0.075 sec
Return array of variants from function: 0.0793 sec
Return variant with array of strings from function: 0.2534 sec
Return variant with array of variants from function: 0.2235 sec

The first shows how fast a typical filter operation would be if the Keys method returned an array of strings, as I think it should. The last shows the same filter operation when returning an array of variants in a variant, as the Keys method does now. The first operation is about three times faster. We're paying a heavy price for using a Dictionary component that was designed for VBScript rather than for VB.

Appendix B. VB Strings and Pointer Arithmetic

One of my favorite songs from the days when I watched Sesame Street with my daughter went like this:

Mama don't allow no guitar playing round here. 
Mama don't allow no guitar playing round here. 
We don't care what mama don't allow, 
we're gonna play guitar anyhow.
Mama don't allow no guitar playing round here. 

And that was the cue for the guitar solo--played dramatically by a big blue muppet. The next verse continued in the same vein:

Mama don't allow no bass playing round here...

Well, it turned out that Mama didn't much care for drums or harmonica or piano, but she had to put up with them anyway. The final verse went like this:

Mama don't allow no pointer arithmetic in Visual Basic...

Hmmm. I think it's about time for our solo. But before we start breaking rules, let me set the stage by talking about Visual Basic's String type.

Basic has always been known for its easy and powerful string handling. Unfortunately, the Visual Basic String type is showing its age. While processing strings in Visual Basic is still easier than in C, it doesn't have any advantages over competing languages like Delphi and C++. Although Visual Basic is becoming more object-oriented, its string capabilities remain completely functional--unlike the string classes in Java and C++. Finally, performance of Visual Basic strings is not particularly good, even with native code produced by VB5 and continued in VB6.

This appendix takes an inside look at Visual Basic strings. By learning how the String type works, we can see why some operations are slow, and how we can speed them up by breaking the rules.

How String Data is Stored
If you do a hex dump of an EXE file created by Visual Basic, you're likely to come to a section of data that looks something like this:

00003950  00 00 00 00 4D 00 6F 00  76 00 65 00 00 00 00 00  ....M.o.v.e.....
00003960  52 00 65 00 66 00 72 00  65 00 73 00 68 00 00 00  R.e.f.r.e.s.h...
00003970  53 00 65 00 74 00 46 00  6F 00 63 00 75 00 73 00  S.e.t.F.o.c.u.s.
00003980  00 00 00 00 5A 00 4F 00  72 00 64 00 65 00 72 00  ....Z.O.r.d.e.r.

Looks like a whole lot of zeros going nowhere. If you wrote the code, you'll recognize it as your string data. All those extra zeros are there because your data is stored in Unicode format. When talking about Unicode, I like to quote the late David Kruglinski--author of Inside Visual C++, Microsoft Press. He said, "Microsoft took away segments and gave us Unicode." So even though 32-bit programming gets rid of all those stupid data size limitations, it gives us a new problem that often seems just as arbitrary and senseless.

I believe that someday Unicode will be worth the pain. We'll have one sensible character format that works for all the languages of the world. But that day is not now. In fact with Visual Basic 5 we have a very ironic situation. Even though strings are stored in Unicode format, you can't actually use Unicode as your character format for internationalization.

For now, let's just think about the different ways that Visual Basic and Windows store strings. The LPSTR, LPWSTR, and BSTR formats are shown in Chapter 2, "Strings Inside Out," page 73. We're going to keep coming back to these three formats throughout the rest of this discussion.

Toggling Case
I first got involved in using APIs to optimize string operations when I needed to search backward within a string. Although the InsStrRev added for VB6 makes this problem irrelevant, we're going to look at it anyway later on. But not yet. Any string searching algorithm other than brute force would be complicated and difficult aside from it's use of strings. So let's start out with a simpler problem--one that will illustrate most of the optimization issues and solutions without confusing us.

We'll start by toggling the case of every character in a string. This isn't a particularly realistic problem. For one thing, we're much more likely to lowercase or uppercase a string, but Visual Basic already provides the LCase and UCase functions. For another, if we did want to toggle case, we probably wouldn't care about performance. Nevertheless we're going to test the performance of the TCase function. I actually use a TCase function written in another language for a professional text editor program because it's easier to remember one toggle command than to remember separate uppercase and lowercase commands. The toggle does what I want 95 percent of the time, and I can adjust the other 5 percent of the time. So we'll pretend that the Visual Basic IDE has keyboard macros like every other programmer's editor in the world and that we're writing TCase as a VB editor command.

Our first version is simple and obvious:

Function TCase1(ByVal s As String) As String
    Dim i As Long, ch As Integer
    For i = 1 To Len(s)
        ch = Asc(Mid$(s, i, 1))
        Select Case ch
        Case 65 To 90       ' A to Z
            ch = ch + 32
        Case 97 To 122      ' a to z
            ch = ch - 32
        ' Ignore non-characters
        End Select
        Mid$(s, i, 1) = Chr$(ch)
    Next
    TCase1 = s
End Function

We loop through each character of the string, extract it, change its case, and reinsert it at the same position. When we've processed every character, we return the modified string. Notice that we pass the string by value so that we're processing a temporary copy, not the original.

You can see this function and all the related TCase functions code in the Test Strings project (TString.vbp). The program provides a means to test toggling of long and short strings, alphanumeric strings that are alpha-heavy and ones that are numeric-heavy, and strings in different languages.

At this point we could start asking optimization questions and start making minor changes. For example, we could ask whether a Select Case statement is a faster way to identify modifiable characters than an If Then block. We could ask whether receiving a string by value is faster than assigning to a temporary variable inside the function. But before we get into micro optimization, let's take a closer look at what's going on behind the scenes.

Programming With B--
In Chapter 10, "What You Don't Know," page 561, I invented a language called B-- to help illustrate all the wonderful things Visual Basic takes care of for us so that we can worry about programming problems. B-- has the syntax of Basic, but it lacks the high-level language features we've come to expect. For example, it doesn't know about automatic memory management or the Common Object Model (COM). Earlier I used B-- to show how tough programming would be if we had to create and manage objects by calling QueryInterface, AddRef, and Release ourselves. Here I'm going to use B-- to show what string allocation would look like if Visual Basic didn't manage it for us.

Here's how the TCase1 function shown above would be written in B--. The original Visual Basic lines being replaced are shown in comments for comparison. Since uncivilized languages like C++ and B-- don't know about the String type, they must use the underlying BSTR type and the BSTR system functions. There are nine of these functions, but for our purposes we're only going to use three--SysStringLen (returns string length), SysAllocStringLen (allocates a string of given length), and SysFreeString (frees an allocated string). The code looks something like this:

Function TCase1(ByVal s As BSTR) As BSTR
    Dim i As Long, ch As Integer, sT As BSTR
    ' For i = 1 To Len(s)
    For i = 1 To SysStringLen(s)
 
        ' ch = Asc(Mid$(s, i, 1))
        sT = SysAllocStringLen("", 1)
        sT = Mid$(s, i, 1)
        ch = Asc(sT)
        SysFreeString sT
 
        Select Case ch
        Case 65 To 90       ' A to Z
            ch = ch + 32
        Case 97 To 122      ' a to z
            ch = ch - 32
        ' Ignore non-characters
        End Select
 
        ' Mid$(s, i, 1) = Chr$(ch)
        sT = SysAllocStringLen("", 1)
        sT = Chr$(ch)
        Mid$(s, i, 1) = sT
        SysFreeString sT
 
    Next
    TCase1 = s
End Function

This is speculative code. If you compile TCase1 to binary code with symbolic information and no optimization, you can load it into a C debugger (such as Microsoft's Visual Studio) and step through the code. You'll see that this isn't exactly what happens. Instead Visual Basic calls internal functions such as __vbaLenBstr and __vbaFreeStr. But trust me. I've written similar code in C++, and this is essentially what happens behind the scenes. Besides, as the inventor and only user of B-- (with no compiler to check up on me), I'm entitled to a little poetic license.

You don't have to know much about memory allocation to figure out that this is really bad code. All we want is to test and modify some characters. We don't even need to extract them. We ought to be able to make the changes in place. If you look back at the diagram of how the String data is stored, you can see how tantalizing close those characters are. We want to just reach out and touch them, but Visual Basic doesn't provide any syntax for that. Instead we have to create temporary strings just to hold one character. We wouldn't need those temporary strings in any other language. C and Delphi, for example, allow you to index into a string and touch a character just as if it were in an array rather than a string. In fact, that's what a string is in those languages: a character array.

This discovery has wide implications. Any algorithm that requires processing characters one at a time is going to hit the same problem. For example, the best-known string search algorithm is called the Boyer-Moore algorithm. But this algorithm depends on a character table and intelligent analysis of potentially matching characters--not necessarily compared in the order in which they appear. It's not difficult to figure out that any efficiencies from the algorithm would be eaten up by the cost of allocating and freeing temporary strings just to check one character. A dumb "brute force" search algorithm would probably come out faster than any smart algorithm.

So what are we going to do about Visual Basic's character processing problems? Before we get to that I want to pose a completely different question about our case toggling algorithm so far. What will TCase1 do if you pass the following string: "Sobre Visual Basic hay muchos libros, pero sólo conozco uno que pueda servirte para ir un paso más allá de lo aprendido aquí." You end up with "sOBRE vISUAL bASIC HAY MUCHOS LIBROS, PERO SóLO CONOZCO UNO QUE PUEDA SERVIRTE PARA IR UN PASO MáS ALLá DE LO APRENDIDO AQUí." Notice that the accented characters don't get uppercased.

By the way, this sentence comes from Manual Imprescindible De Visual Basic 5 (The Indispensable VB5 Book), Anaya Multimedia, 1997, by hardcore programmer Juan Diego Guiterrez Gallardo. In the back of his book, Juan Diego gives a very complimentary plug for my book. According to my Spanish translator (the same daughter I used to watch Sesame Street with), the sentence means "There are many books about Visual Basic, but only one that can serve you to step beyond what can be learned here." He goes on to identify my book with some phrases too flowery to repeat without looking even more like a shameless promoter than I do already.

Make It Right Before You Make It Fast

I've quoted Joe Hacker's favorite saying in other contexts, but let me repeat it here: "It doesn't matter how fast your code is if it doesn't work."

That's what we've got so far. The TCase1 function works for me as an American. It might work for you if you're English, Irish, Scottish, or Australian. But it's not much use if you speak Spanish, Danish, German, or Russian. And I hate to even think of what it might do in Arabic, Japanese, Chinese, or Sumerian. As far as I know, there may not even be such a thing as case in those languages.

So before we make it fast, we're going to have to make it right. That means correcting two bugs. The first bug is that we don't take national language issues into consideration when we test the case of a character. The second is that we don't consider national language when changing case. The second problem is easy. Visual Basic provides the UCase$ and LCase$ functions for changing case. These are smart functions that follow the rules and work correctly regardless of the language.

Unfortunately, Visual Basic doesn't have a solution to the first bug. Many languages have functions for testing character type. C, for example, has a variety of character analysis functions including isalpha, isdigit, isspace, ispunct, islower, and isupper. The closest Visual Basic has to any of these is IsNumeric. Fortunately, the designers of Windows anticipated that some languages might have incomplete libraries, and provided API functions to do whatever we want: IsCharUpper and IsCharLower.

Here's how I handle the code in TCase2:

Function TCase2(ByVal s As String) As String
    Dim i As Long, sCh As String
    For i = 1 To Len(s)
        sCh = Mid$(s, i, 1)
        If IsCharUpper(Asc(sCh)) Then
            sCh = LCase$(sCh)
        Else
            sCh = UCase$(sCh)
        End If
        Mid$(s, i, 1) = sCh
    Next
    TCase2 = s
End Function

The complication here is that UCase$ and LCase$ work on strings, but IsCharUpper and IsCharLower work on characters. You have to use the Asc function to convert the character string into a character (of Integer size) before you can test it.

Keep in mind that there are actually two versions of IsCharUpper--IsCharUpperA (the ANSI version) and IsCharUpperU (the Unicode version). In this case we're using the ANSI version with a type library entry equivalent to following Declare statement:

Declare Function IsCharUpper Lib "user32" Alias "IsCharUpperA" ( _
    ByVal cChar As Byte) As Long

I'd like to say that this code is definitely right, but unfortunately I have to qualify that claim. It wouldn't work for double-byte character languages such as Japanese. In those languages we would have to check to see if the character is a lead byte, and if it was we'd have to skip to the next character before testing the case--again assuming that Japanese has uppercase ideograms or whatever they're called. I don't know enough about internationalization issues to pursue this line inquiry further. Just keep in mind that the issues hang over our heads, even as we ignore them and pretend for the duration of this discussion that all languages work like European languages.

So assuming this code is right, let's see how fast it is. The String Test applications produces the following timings for TCase1 and TCase2:

 
TCase1 time: 321
TCase2 time: 410

Well isn't that special? Do the right thing and end up 50 percent slower. Back to the drawing board.

Do It Faster with the API
If we're making one API call to test the case, why not make another to change the case? It turns out that there are API functions for changing case: CharLower, CharUpper, CharLowerBuff, and CharUpperBuff. The plain versions work on null-terminated strings while the Buff versions work on length specified substrings within strings. Since we're going to be changing one character at a time, we'll need CharLowerBuff and CharUpperBuff.

Like most string-oriented API functions, these come in ANSI and Unicode versions. As we saw earlier, Visual Basic strings are stored in Unicode format, so why bother with ANSI versions that will convert back and forth between the two formats? If we use Unicode all the way, we can skip the translation. (If you have doubts about this strategy, hold your peace.)

One problem with Unicode APIs is that Visual Basic doesn't provide any way to write a Unicode Declare statement. I was hoping this stupid limitation would disappear in VB6, but no such luck. You can use a type library to work around the problem, but that wouldn't do us any good in this case because we're not going to be accessing whole strings. We're going to be accessing individual characters within strings. And the only way to do that is with pointers. But everybody knows Visual Basic doesn't allow pointer arithmetic.

Well, as I said at the beginning, we don't care what Mama don't allow. Here's the code:

Function TCase3(ByVal s As String) As String
    Dim i As Long, ch As Integer
    For i = 0 To Len(s) - 1
        ' Extract a Unicode character from current position
        CopyMemory ch, ByVal StrPtr(s) + (i * 2), 2
        ' Test case of character 
        If IsCharUpperW(ch) Then
            ' Change case of character at current position
            CharLowerBuffW StrPtr(s) + (i * 2), 1
        Else
            CharUpperBuffW StrPtr(s) + (i * 2), 1
        End If
    Next
    TCase3 = s
End Function

Dereferencing Pointers
We'll start by reviewing the CopyMemory statement that extracts a character from the string. Here's the Declare:

Declare Sub CopyMemory Lib "kernel32" Alias "RtlMoveMemory" ( _
    lpvDest As Any, lpvSource As Any, ByVal cbCopy As Long)

See the sidebar "CopyMemory: A Strange and Terrible Saga," Chapter 2, page 90, for the whole story. For now all we care is that you tell it where you want to move to, where you want to move from, and how many bytes to move. In this case we want to move a two-byte Unicode character from a variable location in a Unicode string into an Integer variable called ch.

Notice that CopyMemory is declared to take its first two arguments by reference. This gives us maximum flexibility because we can always override a ByRef default in the Declare by specifying ByVal in the call (but we couldn't override ByVal with ByRef). By reference means that we will use the address of the variable given rather than the value in the variable. For those who know C, a ByRef argument is equivalent to using the C addressof (&) operator.

Consider this statement:

CopyMemory ch, ByVal StrPtr(s) + (i * 2), 2

The first argument is the destination--the address that will be moved to. If putting the destination first and the source second seems strange, you're probably not an assembly language programmer. In any case, the first argument says that the source value should be moved to the address of the ch variable. In other words, it's an assignment to ch.

In the second argument the default ByRef is overridden with the ByVal keyword. This means that we're going to pass a value, not the address of a variable. CopyMemory expects its second argument to be an address, so we had better make sure that the value we pass is actually an address. An address passed as a value is called a pointer--that is it's a value that points to the address of a variable. The undocumented StrPtr function returns a pointer to the Unicode string in its argument.

To that we add a calculation of the offset of the current character. The first time through the loop the second argument is a pointer to the first character of the string--StrPtr(s)--plus an offset of 0 (0 * 2 is 0). The second time through the argument is a pointer to the first character plus 2 (1 * 2 is 2)--which is the second character. And so on to the end.

The third argument is passed by value and is always two--the width in bytes of a Unicode character.

What we have done is called dereferencing a pointer. With the public introduction of AddressOf and the secret introduction of VarPtr, StrPtr, and ObjPtr, VB5 officially became a pointer language. And not just any pointer language. It's the worst pointer language ever created--the only one that doesn't support dereferencing. For example, in C we would dereference this pointer with the * operator like this:

ch = *s;

This is equivalent to the following VB statement:

CopyMemory ch, ByVal StrPtr(s), 2

In my opinion VB should add features that allow you to do the things you do with pointer in C without pointers. But it's too late for that. If they're going to add undocumented pointer functions like StrPtr, they should also add undocumented pointer dereferencing functions such as PtrToChar, PtrToStr, and PtrToLong?

Once you've got dereferencing down, the rest of the code is easy. First you test the case of the extracted character with IsCharUpperW. Then you change the current character with CharLowerBuffW or CharUpperBuffW. The Declares look like this:

Declare Function CharLowerBuffW Lib "user32" ( _
    ByVal lpsz As Long, ByVal cchLength As Long) As Long

Normally the first argument would be declared as a ByVal String, but since this is a Unicode function and since we're going to be accessing the middle of the string, we have to pass a pointer by value and make sure we pass the value as an address from StrPtr.

The statement

CharLowerBuffW StrPtr(s) + (i * 2), 1

works the same as the CopyMemory statement we saw earlier except that CharLowerBuffW deals with characters rather than bytes. When we tell it to process one character, it knows we really want two bytes.

We've jumped through the hoops as requested. Let's check the payoff. Here are the timings from the test program:

 
TCase1 time: 321
TCase2 time: 410
TCase3 time: 140

That's better. These timing were done on Windows NT and they show a 293 percent increase from TCase1. And it's even faster on Windows 98. My timings show it to be 433 percent faster.

Wait a minute! How can it be 293 percent faster on Windows NT and 433 percent faster on Windows 98? Remember what Joe Hacker says? "It doesn't matter how fast your code is if it doesn't work." And this doesn't work. The test program shows a faster speed, but it also shows that the case of the string doesn't get toggled. The reason is simple. Under Windows 98 the code for IsCharUpperW and CharUpperBuffW is stubbed out. If Windows 98 were written in Visual Basic, the implementation of CharUpperBuffW would look like this:

Function CharUpperBuffW(s As String, c As Long) As Long
End Function

A Short Flame About Windows, 95 and 98
I
try to be reasonable and positive about Microsoft. I own some of their stock. I want them to do the right thing and be successful. And that's why I have to flame at the C bigots who made this idiotic decision when designing Windows 95. They made a strategic decision not to do Unicode in Windows 95, and I can respect that and acknowledge the tradeoffs that went into the choice.

Supporting Unicode for every function in the general purpose API that happened to have a string argument would have taken a huge commitment of resources and a lot of extra processing down in the bowels of the operating system. I know why they chose not to support Unicode for miscellaneous API functions like GetCurrentDirectory and SetWindowText. But that's a very different decision than deciding not to support Unicode in utility functions like IsCharUpper, CharLowerBuff, and even lstrcpy. It's not as if they didn't know how. They did give ANSI and Unicode implementations of lstrlen in Windows 95, and they could have done it for other utility functions.

The designers of Windows 95 knew that they had to support COM. They knew that all COM-compliant strings are Unicode. They had access (from Windows NT) to source for general Unicode support functions. They knew that these are independent functions that don't require major changes throughout the operating system. In other words, they knew better. What they didn't seem to know is that some people program Windows in languages other than C and C++. In C-based languages if the operating systems doesn't support some operation, the C library probably does. But Visual Basic programmers depend on the operating system to provide essential functionality. In this case, the operating system let us down. Probably this decision was made by the same person who created the CopyMemory/RtlMoveMemory mess.

As if the original crime weren't bad enough, they've repeated it for Windows 98. The decision not to provide general Unicode support might have been justifiable then, but it's extremely questionable now. The decision not to support Unicode in utility and national language functions is even more debatable--especially since Microsoft has proved in another context how easy it would have been. We'll get to that other context shortly.

What this means is that we must accept that maximum performance in portable API string operations is just out of our reach. We'll have to either assume all our customers have Windows NT (unlikely) or we'll have to deal with the ANSI version of API functions. That choice has a cost, not only in speed, but in correctness for all languages. Since this chapter isn't intended as a discussion of internationalization, we're not going to talk about string processing for languages that have double byte characters except to note that it shouldn't be a problem. When TCase3 works, it works for any language you throw at it. Too bad we can't just stop here.

A First Try With ANSI APIs
In theory we ought to be able to do exactly the same thing with ANSI functions that we just did with Unicode functions. In reality, it's not quite the same. Let's look at the code before we discuss the problems.

Function TCase4(ByVal s As String) As String
    Dim i As Long, ch As Byte, c As Long
    For i = 0 To Len(s) - 1
        CopyMemory ch, ByVal StrPtr(s) + (i * 2), 1
        If IsCharUpper(ch) Then
            CharLowerBuffPtr StrPtr(s) + (i * 2), 1
        Else
            CharUpperBuffPtr StrPtr(s) + (i * 2), 1
        End If
    Next
    TCase4 = s
End Function

This looks a lot like the last version, but notice that the ch variable is a Byte rather than an Integer. The CopyMemory statement copies one character rather than two. But we know that StrPtr is returning the address of the actual Unicode characters being copied. How can we chop off half of a Unicode character and expect to get the right result? Well, we can't. Not always. This code certainly won't work for Unicode character 823 or 1458. It will work for most of the first 256 Unicode characters because they are the same as the first 256 characters for most European code pages. The other half of the character is zero, so there's no need to copy it over.

There are exceptions to the rule. Try the following code:

For i = 0 To 255
    Debug.Print i, Hex$(AscW(Chr$(i))) 
Next

Results may vary depending on your regional settings, but you'll find that some of the codes in the upper half (including 145 to 153 on all the machines I've tried) don't have zeros in the upper half. Fortunately, the upper half is zero for the codes that represent the accented characters frequently used in text in European countries. Still, any technique based on the assumption that Unicode characters are always half zero has a chance of failing for some languages.

Now let's look at the second part of the processing:

CharUpperBuffPtr StrPtr(s) + (i * 2), 1

This statement is using a specialized API statement that exists in the ANSI Windows API Type Library and is equivalent to this Declare statement:

Declare Function CharUpperBuffPtr Lib "user32" _
    Alias "CharUpperBuffA" (ByVal lpsz As Long, _
    ByVal cchLength As Long) As Long

The key to writing Declare statements is to understand that Windows has an unvarying and non-negotiable expectation about the arguments it will receive. In this case the CharUpperBuffA function expects it's first argument to be the address of a string of ANSI characters. Fortunately, there are numerous ways that Visual Basic can satisfy Windows' request, depending on how you write the Declare. This alias satisfies the request by requiring that the address by passed by value as a Long. Since the ByVal is in the Declare, no need to put it in the call.

The pointer arithmetic works for this statement as long as the Unicode character being pointed to is only 8 bits. As far as the ANSI version of the function knows, each character is only 8 bits wide. Since we're only looking at one character at a time, it works. But what if we looked at more than one character. Well, it would probably still work because "AbC" would be processed as A, null, b, null, C, null. Changing a null to uppercase has no effect--except wasted processing time. You would also have to double the buffer length passed in the third parameter.

Even if this technique works, it looks awkward and fragile. Let's check the performance:

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220

Well, it's not the worst possible algorithm, but it is slower than the correct (but Windows-95-hostile) code in TCase3. This is discouraging. We're getting slower and less correct. Maybe it's time to try something completely different.

Working On Byte Arrays Rather Than Strings
An ANSI string is in reality just an array of bytes. So why not look at it as an array of bytes. Here's a new version of TCase that does just that:

Function TCase5(s As String) As String
    Dim i As Long, ab() As Byte, ch As Byte
    ab = StrConv(s, vbFromUnicode)
    For i = 0 To Len(s) - 1
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        Else
            CharUpperBuffB ab(i), 1
        End If
    Next
    TCase5 = StrConv(ab, vbUnicode)
End Function

The first thing to notice about this function is that it passes it's string by reference. Passing a string by reference is faster--but it's wrong if you modify the original string unintentionally. In this case we'll be copying to a temporary byte array and working on that rather than on the original string. Unlike other arrays, byte arrays have a special relationship with strings. You can use the assignment operator to copy between the two types. For example:

ab = s   ' Copy string to byte array
s = ab   ' Copy byte array to string

These are legal statements, but of course the result is a byte array with a zero in every other byte. In this case we wanted a byte array consisting of the original Unicode converted to ANSI. That's what StrConv does in the code above.

With a byte array, it's a simple matter to pass each character to IsCharUpper. Passing to CharUpperBuff is a little more tricky. We need the following Declare statement (or a type library equivalent):

Declare Function CharUpperBuffB Lib "user32" _
    Alias "CharUpperBuffA" (lpsz As Byte, _
    ByVal cchLength As Long) As Long

Windows expects a pointer to the character data, and this Declare provides one as a ByRef Byte parameter. If we pass the current byte of the array by reference, Windows will see the address of that byte.

Once we've processed all the bytes, we have to convert back to a string. Processing should be fast here, but what's it going to cost to convert the strings back and forth between ANSI and Unicode?

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220
TCase5 time: 251

At first glance this appears to be one more step backward, but think again. On a small string converting to ANSI and back to Unicode is going to be a significant part of the processing time. On a large string the conversion will take a much smaller percentage of the processing time. You can see this by running the test program on a much larger string.

TCase1 time: 2364
TCase2 time: 3064
TCase3 time: 982
TCase4 time: 1613
TCase5 time: 1502

This is much better. It's still not as fast as the Unicode version in TCase3, but it's better than any of the others. Of course speed isn't everything. This version takes a lot more memory. We don't really want to waste time or data space creating a temporary target string rather than working on the real thing. Imagine the wasted memory if the string were a one megabyte text file read from a disk.

Forget the String, Just Use the Byte Array
One way to avoid all the unnecessary conversion is to use Byte arrays all along. If you're reading a string from a file, read it directly into a byte array rather than a string. Of course this isn't always feasible. If the data comes from a TextBox or other control, it's going to be a String whether you like it or not. But you may be able to pass it around as a byte array for quite a while before you have to assign it to something that expects a string.

If you plan on doing this, it's not realistic to include the string conversion in the cost of the function.

What you really want is a TCase function that works directly on Byte arrays. You might want to do this in either of two ways. First, here's a Sub that processes a byte array in place:

Sub TCase6(ab() As Byte)
    Dim i As Long
    For i = LBound(ab) To UBound(ab)
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        Else
            CharUpperBuffB ab(i), 1
        End If
    Next
End Sub

VB6 allows you to return a byte array. Here's a Function version that returns a modified copy of an array in a function:

 
Function TCase7(ab() As Byte) As Byte()
    Dim i As Long, abRet() As Byte
    ' Copy the input array to a temporary value and process
    abRet = ab
    For i = LBound(abRet) To UBound(abRet)
        If IsCharUpper(abRet(i)) Then
            CharLowerBuffB abRet(i), 1
        Else
            CharUpperBuffB abRet(i), 1
        End If
    Next
    ' Return temporary
    TCase7 = abRet
End Function

VB won't let you pass an array by value, so you have to create a separate temporary array to hold the return value. That way you won't modify the input array.

These two perform respectably, although still not as well at the NT-only TCase3.

TCase1 time: 321
TCase2 time: 410
TCase3 time: 140
TCase4 time: 220
TCase5 time: 251
TCase6 time: 200
TCase7 time: 221

The byte array versions won't work for every situation, but sometimes one of them is a perfect fit.

What About the Algorithm?

So far we've been experimenting with different ways of testing and changing the characters. But what about the algorithm itself? You may have noticed that we always change the case--even if the character is numeric or punctuation. Of course CharUpperBuff isn't going to change the case of a comma or a zero. That's impossible by definition. But we could save a function call if we already knew that a particular character had no case. On the other hand, we'd have to do two tests instead of one. It's a tradeoff, and the results depend on the internals of IsCharUpper/Lower and CharUpper/LowerBuff. There's only one way to find out. Here's a slightly modified version of TCase6:

Sub TCase8(ab() As Byte)
    Dim i As Long
    For i = LBound(ab) To UBound(ab)
        If IsCharUpper(ab(i)) Then
            CharLowerBuffB ab(i), 1
        ElseIf IsCharLower(ab(i)) Then
            CharUpperBuffB ab(i), 1
        End If
    Next
End Sub

Here are the comparative timings:

TCase6 time: 200
TCase8 time: 210

We're going backward again, but if you look at the test program it's easy to see why. The initial test string--"Make things simpler than possible"--is all alphabetic. We're adding a test, but we're still changing the case for every character. To get a benefit from this change we'd have to test with a string containing many non-alphabetic characters. For example with the string "See 3,987,654,456,429 bugs" we get these results:

TCase6 time: 160
TCase8 time: 121

The choice depends on the circumstances. In most cases where we want to toggle case, the text to be modified would be mostly alphabetic. The cost of a few unnecessary case changes would be worth it if we could avoid an extra case test on every character. So we'd probably stick with the first algorithm.

The point is, you have to consider the data you're most likely to be processing to determine the best way of processing it. Perhaps this lesson doesn't matter much when toggling case, but then toggling case isn't a very typical problem, which is what I said right up front. Let's get back to the original problem--searching strings backwards.

Using the SHLWAPI Library

Before we waste a lot of time figuring out search algorithms, let's try stealing some that are already available--or at least sort of available. Internet Explorer comes with a DLL called SHLWAPI, which stands for Shell Lightweight API. This DLL contains a bunch of handy functions including path functions, registry functions, most important to us, string functions.

Here's a list showing how the most useful string functions in SHLWAPI relate to similar Visual Basic string functions or statements:

Operation

Visual Basic

SHLWAPI

Copy

=

StrCpyN

Concatenation

+ or &

StrCatN

Find character

None

StrChr, StrChrI

Find character reverse

None

StrRChr, StrRChrI

Find string

InStr

StrStr, StrStrI

Find string reverse

InStrRev

StrRStrI

Compare strings

StrComp or =

StrCmpN, StrCmpNI

Overwrite substring

Mid statement

StrCpyN

Trim blank space

Trim, LTrim, RTrim

StrTrim

Tokenize

Split

StrSpn, StrCSpn

This doesn't look too exciting at first glance. Many of these functions already have duplicates with similar names in the Windows API. Fortunately, the SHLWAPI functions have one thing that other Windows API string functions don't have. They come in Unicode and ANSI versions even on Windows 95 and 98.

Unfortunately, the SHLWAPI library has an ambiguous legal status. The DLL comes as part of Internet Explorer 3.0 or 4.0 or Windows 98. Since Internet Explorer was part of Windows NT 4.0, SHLWAPI is also part of that. It did not come with the original release of Windows 95. So if you have Windows 95 customers who prefer a browser other than Internet Explorer, they may not have this DLL.

So is SHLWAPI part of the operating system? Or is it part of Internet Explorer? If Internet Explorer is part of the operating system as Microsoft keeps telling the United States Justice Department, surely they wouldn't care if you helped them update their operating system by shipping SHLWAPI as part of your setup. Or maybe they would care. Do you have to ship all of Internet Explorer to get unrelated string processing power? This is the same issue mentioned earlier in a discussion of how to update the COMCTL32.DLL required by the Coolbar control.

We'll deal with this question again soon, but in the meantime here are a couple other interesting questions about SHLWAPI: If Microsoft could put Unicode versions of string utility functions in this DLL, why couldn't they have updated the string utility functions in the Windows 98 kernel? Finally, what would go through the mind of a DLL designer who provided functions StrStr, StrStrI, and StrRStrI, but not StrRStr? Somebody thought that we would like to search forward and backward case-insensitive and search forward case-sensitive, but that we would never want to search backward case-insensitive. Explain that one.

Searching Strings Backward

Searching strings with the SHLWAPI functions takes some serious pointer arithmetic. I wrote three different versions: FindStr, FindStrI, and FindStrRI. There is no FindStrR because of the missing StrRStr. Let's start with the FindStrRI, which is the one that started me on this quest in the first place.

FindStrRI is basically a wrapper for the StrRStrI API function. Ideally this code would be defined in a type library, but using a type library for a DLL that may not exist on some machines is a dangerous business. There's no way to fail gracefully. If a DLL referenced through a type library doesn't exist, the program won't even load. Declare statements, however, are enabled dynamically at run-time, and you can fail gracefully using techniques discussed in Chapter 6, "System-Specific API Functions," page 322-24.

So we're stuck with Declares for the SHLWAPI functions, but the next problem is the lack of support for Unicode Declares. If VB6 had enabled a Unicode Declare syntax, we might use it like this:

Declare Unicode Function StrRStrI Lib "shlwapi" _
    Alias "StrRStrIW" (ByVal lpSource As String, _
    ByVal lpLast As Long, ByVal lpSrch As String) As Long

The lpSource parameter is the string to be searched, the lpLast parameter is where you start searching (not always the end), and the lpSearch parameter is the substring to be found. In C all three parameters have the same LPTSTR type. It would be theoretically possible to define lpLast as String in this imaginary Basic version, but that wouldn't fit in with the normal use of the function. When searching strings, you often have to walk through the string, finding all matches, not just the first one. That requires keeping track of where you are, and the only way to do it is to save your current position as a pointer. You can do that with a C LPTSTR, but not with a Basic String. The C version also returns an LPTSTR to indicate the position of the found string, but the Basic must use Long for the pointer return.

The type library definition that I experimented with for a while worked like the imaginary Declare shown above. The real Declare had to look a little different:

Declare Function StrRStrI Lib "SHLWAPI" Alias "StrRStrIW" ( _
    ByVal lpSource As Long, ByVal lpLast As Long, _
    ByVal lpSrch As Long) As Long

In this version all the strings are declared as pointer of type Long, not because they need to be, but because there's no other syntax. Here's the code wrapper that handles the pointers:

Function FindStrRI(sTarget As String, sFind As String, _
                   Optional ByVal iPos As Long = 0) As Long
    ' FindStrRI = 0
    Dim pTarget As Long, pRet As Long
    ' Default position is end of string
    If iPos = 0 Then iPos = Len(sTarget)
    ' Turn string into pointer
    pTarget = StrPtr(sTarget)
    ' Get a pointer to the last match (if any)
    pRet = StrRStrI(pTarget, (pTarget + ((iPos - 1) * 2)), _
                    StrPtr(sFind))
    ' If match, convert pointer to a string index
    If pRet Then FindStrRI = ((pRet - pTarget) / 2) + 1
End Function

This is messy, but not so bad as you might expect. Since StrRStrI is part of a Windows DLL, you can reasonably expect it to give good performance. Or can you?

The Test String program (TString.vbp) actually demonstrates four different methods of searching backwards. InStrRev is a new Visual Basic function that I didn't expect to ever enter the language when I started investigating this subject. FindStrRI is the function above based on StrRStrI from SHLWAPI. FindStringR and FindStringRI are brute force versions based on heavy use of StrRChr, StrRChrI, StrCmpN, and StrCmpNI from SHLWAPI. Finally, InStrR is the quick-and-dirty, Basic-only version I wrote for the second edition (borrowed from the Utility module). May we have the envelope please:

InStrRev time: 100
FindStrRI time: 301
FindStringRI time: 230
InStrR time: 381

Those are for the case-insensitive search. These are for case sensitive search (with no FindStrR because of the missing StrRStr):

InStrRev time: 10
FindStringR time: 20
InStrR time: 271

The Visual Basic library wins by a mile. This is encouraging. I've been critical of Visual Basic designers, but there are still plenty of good programmers on the VB team. One of them wrote some very tight code here. Unlike many of the new library features in VB6, this isn't some half-baked code from VBScript. A real programmer wrote it, and wrote it well. I was a little shocked at how far behind the code based on StrRStrI was, and even more shocked that my brute-force VB algorithm (based on StrRChrI and StrCmpNI) beat their C-based algorithm. Somebody in the Windows group wrote some pretty sloppy code (maybe the same jerk who forgot to add StrRStr).

I wasn't surprised that my all-Basic algorithm came in last. This timing actually disguises how bad it really is because the test uses a short string. It looks a lot worse on long strings. A comment in the original code admitted its weakness and hoped for a replacement--which I provide in the new version by mapping InStrR to the slightly different syntax of InStrRev. For those of you still using VB5, the VB6Funcs.bas module has a conditionally-defined InStrRev function based on my original InStrR code.

For those who want to experiment with other SHLWAPI functions, the Declares are available in the LightShell module (LightShell.bas). The module is not used by VBCore, but the Test String program uses it. It also has a HasShellWapi function that tells whether the required DLL is available.

So where are we? So far I've failed to show any interesting use for string processing based on pointer arithmetic. The case toggling code was interesting, but not very useful. The string searching techniques turned out to be losers. But now we're going to take a fresh look at a problem where pointer arithmetic really makes a difference. The problem is parsing.

Back To Code Review

I first looked at parsing in Chapter 5, "Code Review," page 253. This was a fictitious (mostly) tale of how I gradually fine-tuned some very old parsing code to achieve what I believed to be the most efficient, correct Visual Basic algorithm possible. And perhaps it was. But my new version taking advantage of parsing functions in SHLWAPI beats the heck out of the original.

If you've studied the new features of VB6, you may suspect that I'm hosed again. VB6 provided the InStrRev function to put my reverse string searching efforts to shame. It also provides a parsing function called Split that should crush my improved parsing routines. Bear with me. We'll talk about Split when the time is right.

My parsing code is based on a function called GetToken that you call in a loop until all the tokens of a string had been processed. For you C programmers, it's a lot like the strtok function. GetToken was in turn based two functions called StrBreak and StrSpan. StrBreak steps through a string until it finds a separator character. StrSpan steps through separator characters until it finds a non-separator character. StrBreak and StrSpan are based on the C strcspn and strspn functions respectively.

Although SHLWAPI doesn't have anything comparable to strtok, it does have StrSpn and StrCSpn functions (strspn and strcspn with different case) from which you can build your own strtok (or GetToken as I call it in Basic). This puts a little different spin on my original "Code Review" section. The subsection "Coding C in Basic" pointed out that translating C code to Basic literally would result in inefficient code because of the differences between the languages. Well, the following code does C in Basic deliberately. It doesn't look like Basic and it's not very readable. But because it is calling C DLL functions the way C programmers would call them, it is more efficient than Basic.

Here's the code:

Function GetToken(sTarget As String, sSeps As String) As String
    
    ' If SHLWAPI.DLL not available, delegate to Basic-only GetToken
    If fNoShlWapi Then
        GetToken = GetTokenO(sTarget, sSeps)
        Exit Function
    End If
    
    ' Note that sSave, pSave, pCur, and cSave must be static between calls
    Static sSave As String, pSave As Long, pCur As Long, cSave As Long
    ' First time through save start and length of string
    If sTarget <> sEmpty Then
        ' Save in case sTarget is moveable string (Command$)
        sSave = sTarget
        pSave = StrPtr(sSave)
        pCur = pSave
        cSave = Len(sSave)
    Else
        ' Quit if past end (also catches null or empty target)
        If pCur >= pSave + (cSave * 2) Then Exit Function
    End If
    
    ' Find start of next token
    Dim pNew As Long, c As Long, pSeps As Long
    pSeps = StrPtr(sSeps)
    c = StrSpn(pCur, pSeps)
    ' Set position to start of token
    If c Then pCur = pCur + (c * 2)
    
    ' Find end of token
    c = StrCSpn(pCur, pSeps)
    ' If token length is zero, we're at end
    If c = 0 Then Exit Function
    
    ' Cut token out of target string
    GetToken = String$(c, 0)
    CopyMemory ByVal StrPtr(GetToken), ByVal pCur, c * 2
    ' Set new starting position
    pCur = pCur + (c * 2)
 
End Function

Let's delay discussion of the initial test for SHLWAPI and focus on the parsing logic. If you look back at the final GetToken in Chapter 5, "Code Review," page 261, this code looks familiar. It's doing the same thing, but with StrSpn and StrCSpn rather than with StrSpan and StrBreak. The signature of the function is the same, and this version simply replaces the old one in the Parse module of the VBCore component. Here's a review of how the code is called:

' Strip all the separator characters out of a command line
sToken = GetToken(sCmdLine, " .," & sCrLf & sTab)
s = sToken
Do While sToken <> sEmpty
    sToken = GetToken(sEmpty, sSeparator)
    s = s & sToken
Loop

This strips out spaces, tabs, commas, periods, and line breaks. Now let's see how the VB6 Split function stacks up against my GetToken.

Split Sucks
The Split function apparently comes from VBScript. If you look it up in help, you'll find a VB version and a VBScript version with similar descriptions in a slightly different format. The main difference is that the VBScript documentation has the common decency to provide a sample, albeit a bad one. The other difference is that the VB version has the four standard compare options (the same as for InStr or StrComp) rather than just binary and text.

Rather than being called in a loop, Split parses an entire string and puts the result in a Variant containing an array. It's easy to use Split equivalent to write code equivalent to the GetToken example shown earlier:

' Strip all the separator characters out of a command line
Dim av As Variant, iTok As Long
av = Split(sCmdLine, " .," & sCrLf & sTab)
For iTok = 0 To UBound(av) - 1
    s = s & av(iTok)
Next

This is simpler than the GetToken version, and some may prefer to give up a few milliseconds of performance to make it even easier with a For Each block:

av = Split(sCmdLine, " .," & sCrLf & sTab)
For Each v In av
    s = s & v
Next

Furthermore, if you time it, you'll see that it's a lot faster than my GetToken. That's because it doesn't work. What you get here, is an array of one element with the whole string in it. You can get a clue of the problem from the documentation, which calls the second parameter delimiter. Notice that is says delimiter, not delimiters.

The problem here is those damn users. Why can't they just use one delimiter? Why would they ever want to use commas or periods in addition to spaces to separate their tokens? Why would they use tabs or break their text into separate lines? Split works fine if you just give it one delimiter the way God intended:

sCmdLine = "This is a test"
av = Split(sCmdLine, " ")

Users who toe the line as in this example get just what they expect: four tokens in their array. But users are clever in their obnoxious attempts to flout rules. Some of them will go so far as to separate their tokens with multiple delimiters:

sCmdLine = "This  is  a  test"
av = Split(sCmdLine, " ")

Anyone foolish enough to do this will get seven tokens in their array. If you look carefully at the input string, you can see the three empty string tokens between the adjacent spaces.

Let's go back to the separators. Let's say that you wrote a Split function that accepted only one delimiter, and you carefully spelled it out singular, not plural, in the documentation. What would you do if some foolish user ignored your instructions and gave you a multi-character string for the delimiter. Here are some possibilities:

  1. Raise an error telling the users in no uncertain terms to follow directions or see their failures displayed in error boxes for the whole world to see.
  2. Just use the first character, ignoring all those worthless extra characters.
  3. Check the length of the delimiter string and if it's not the magic number one, return the whole string as one big token.

Well, of course you can't do 1 because (unfortunately) users can turn off error trapping. And 2 just doesn't fully express your contempt for those who disregard your instructions. So 3 is the correct answer because it's so deliciously unpredictable.

Enough sarcasm. It's hard to fully express my contempt for the VBScript developer who wrote this crap or for the VB designers who accepted it into their library without checking to see that it meets minimum standards of competence. Perhaps defenders will argue that the documentation never claims Split is a parsing function, and it's therefore unfair to criticize it for not being one. It does a perfectly adequate job of doing exactly what it claims to do. Furthermore, the Join procedure does a perfectly good job of undoing what Split does. The problem is, the job being done isn't worth doing or undoing. In short, their Split function is hardly competition for my GetToken function. Nor is it competition for my Splits function, which simply wraps GetToken with a Split-like syntax.

Hacking for Portability
We've got a much faster GetToken, but do we dare use it? What's going to happen if one of your customers bought the original Windows 95 and has never installed Internet Explorer? Your unpleasant choices about redistributing SHLWAPI.DLL were discussed earlier. The Parse module is designed to work regardless of your choice, but it may be slower on some systems.

We already have working Basic code to parse text. The enhanced version using SHLWAPI is no more accurate than the original, just faster. So I've written the code to use SHLWAPI if available or to use the old Basic code if not. The old GetToken is renamed to GetTokenO and made private in the Parse module. Then I check for the existence of the DLL and delegate to the old slow version if the new fast one isn't available:

' New GetToken uses faster StrSpn and StrCSpn from SHLWAPI.DLL
Function GetToken(sTarget As String, sSeps As String) As String
    
    ' If SHLWAPI.DLL not available, delegate to Basic-only GetToken
    If fNoShlWapi Then
        GetToken = GetTokenO(sTarget, sSeps)
        Exit Function
    End If
    . . .

I use similar code for the GetQToken function (which parses quoted strings as separate tokens). This looks simple, but there's more ugly code than you might expect behind the test. It's not so bad in the global module version. You test in Class_Initialize to see if the DLL exists and set the variable accordingly. But in the standard module version fNoShlWapi is actually a Property Get procedure that has to jumps through hoops to make sure the time-consuming DLL test code gets called only once:

#If fComponent Then
' In component fNoShlWapi is a variable initialized in Class_Initialize
Private fNoShlWapi As Boolean ' = False
 
Private Sub Class_Initialize()
    On Error GoTo Fail
    ' Dummy call to any SHLWAPI function fails if no DLL
    Call StrSpn(StrPtr("a"), StrPtr("a"))
    Exit Sub
Fail:
    fNoShlWapi = True
End Sub
 
#Else
' In standard module fNoShlWapi is a Property Get that checks for DLL
Private fNotFirstTime As Boolean, fNoShlWapiI As Boolean ' = False
 
Private Property Get fNoShlWapi() As Boolean
    If fNotFirstTime = False Then
        fNotFirstTime = True
        On Error Resume Next
        Call StrSpn(StrPtr("a"), StrPtr("a"))
        If Err Then fNoShlWapiI = True
    End If
    fNoShlWapi = fNoShlWapiI
End Property
#End If

What a mess!
Fortunately, in real life most string operations aren't time critical and we can do what comes naturally with the familiar Basic syntax. But when you have to do intense processing on large chunks of text, it may be worth your while to think about the ugly truths of pointer arithmetic.

Previous Page   Table of Contents