Sunday, July 1, 2007

Escaping Strings for MSFT RTL Parsing

Escaping Strings for MSFT RTL Parsing



Today's topic is how to convert any string into one that can be parsed an Microsoft RTL-using program. Note, however, that not all strings can be so converted -- in particular, any string with an embedded NUL character is not convertable.

Why do this? Because sometimes you have to generate output that other programs will read in through their command line. When you do, you have to escape your output so that the other program will handle it correctly. For instance, suppose I want to generate a string with a space in it (like "C:\Program Files\myprog\foo.txt") -- how can I write it out so that the other program gets exactly what I wrote?

This is the inverse problem of the last post. In the last post, the question answered was: what are the rules that the Microsoft Run-Time-Library has for parsing command line arguments. Note that today's solution is not complete: the various shells will ALSO do their own parsing (to expand environment variables) before handing off the command line.

The hard case, and therefore the one to tackle first, is to convert a string so that it can be put into Double-Quotes. The rules from the last entry say that the only character that has to be escaped is the duoble-quote, but it that it has very funny rules about multiple double quotes, and whether the number of preceeding backslashes is odd or even. It also depends on whether we're already in 'inquote' mode or not.

To make life easy, let's decide to always start off with a Double-Quote and always stay in 'inquote' mode. Further, we have to figure out how to deal with the following-Double-Quote rule (number 3.1.1 in 'inquote' mode). We could write fancy code to figure out if there was going to be a following Double-Quote, but instead I'll take the easy way out and always add in the following Double-Quote myself. That way the next Double-Quote doesn't have to be handled in an extra-special way.

The parsing rules say that backslashes aren't handled specially unless they are preceeding a Double-Quote. I'll have to keep track of how many there are in a row.

And finally we get the following code:


Buffer BORString::Convert::AnyToMSFTCArg (const Pointer& Src_In)
{
Pointer Src (Src_In);
Buffer Retval;
Retval.Resize (Src.StrLen() + 2 + 4);
// Pre-size the return buffer once. It will
// automatically resize as needed

// p is the temporary pointer into Retval
Pointer p = Retval.p();

// Keep track of how many backslashes we've
// seen that preceed a Double-Quote.
int NSlash = 0;

// Temp. character. Everything is wide-char,
// but should trivially convert to narrow.
wchar_t c;

// Always slip straight into inquote mode.
p.Append ('"');
while ((c=Src.GetW()) != (wchar_t)-1)
{
if (c=='"')
{
// Double the backslashes
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
// Always write out the Double-Quote
// twice just in case the next char
// is also a quote.
p.Append ('"');
p.Append ('"');
}
else
{
p.Append (c);
}
NSlash = (c=='\\') ? NSlash+1 : 0;
}

// The last slash, if any, also has to be doubled.
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
p.Append ('"');

return Retval;
}



The 'Buffer' and 'Pointer' is part of a string library I use -- they are used so little here that I think you can figure out what they do.

The next topic: how CMD.EXE parses the command line before your program ever sees it.

No comments: