Wednesday, July 18, 2007

Command-line parsing: more details

Command-line parsing: more details



In the June 26, 2007 post "CMD.EXE Parsing - Splitting into Arguments" I listed the rules that Microsoft's C RTL uses when console apps split their command line into arguments. Rule number two was "2. The first argument (the program name) is parsed specially".

The actual rule for the first argument isn't terribly complex. Like regular arguments you're either in 'inquote' mode or not; when in 'inquote' mode the only end for the argument is the NUL character (but you can switch out of 'inquote' mode). Unlike the regular rules, there are escape characters -- a backslash is just a backslash, and there's no way to embed a double-quote. The double-quotes that switch in and out of 'inquote' mode are, of course, not considered part of the argument.

When not in 'inquote' mode a space or tab exits the parsing.

By the way -- Microsoft actually publishes the underlying source code for all of this. When you compile in DEBUG mode, you can even set breakpoints on the C RTL startup part of your code. In the "Visual Studio 8" directory (which is what they call 'Visual Studio 20005'), look for VC\crt\src\stdargv.c

Sunday, July 1, 2007

Escaping Strings for MSFT RTL Parsing

Escaping Strings for MSFT RTL Parsing



Today's topic is how to convert any string into one that can be parsed an Microsoft RTL-using program. Note, however, that not all strings can be so converted -- in particular, any string with an embedded NUL character is not convertable.

Why do this? Because sometimes you have to generate output that other programs will read in through their command line. When you do, you have to escape your output so that the other program will handle it correctly. For instance, suppose I want to generate a string with a space in it (like "C:\Program Files\myprog\foo.txt") -- how can I write it out so that the other program gets exactly what I wrote?

This is the inverse problem of the last post. In the last post, the question answered was: what are the rules that the Microsoft Run-Time-Library has for parsing command line arguments. Note that today's solution is not complete: the various shells will ALSO do their own parsing (to expand environment variables) before handing off the command line.

The hard case, and therefore the one to tackle first, is to convert a string so that it can be put into Double-Quotes. The rules from the last entry say that the only character that has to be escaped is the duoble-quote, but it that it has very funny rules about multiple double quotes, and whether the number of preceeding backslashes is odd or even. It also depends on whether we're already in 'inquote' mode or not.

To make life easy, let's decide to always start off with a Double-Quote and always stay in 'inquote' mode. Further, we have to figure out how to deal with the following-Double-Quote rule (number 3.1.1 in 'inquote' mode). We could write fancy code to figure out if there was going to be a following Double-Quote, but instead I'll take the easy way out and always add in the following Double-Quote myself. That way the next Double-Quote doesn't have to be handled in an extra-special way.

The parsing rules say that backslashes aren't handled specially unless they are preceeding a Double-Quote. I'll have to keep track of how many there are in a row.

And finally we get the following code:


Buffer BORString::Convert::AnyToMSFTCArg (const Pointer& Src_In)
{
Pointer Src (Src_In);
Buffer Retval;
Retval.Resize (Src.StrLen() + 2 + 4);
// Pre-size the return buffer once. It will
// automatically resize as needed

// p is the temporary pointer into Retval
Pointer p = Retval.p();

// Keep track of how many backslashes we've
// seen that preceed a Double-Quote.
int NSlash = 0;

// Temp. character. Everything is wide-char,
// but should trivially convert to narrow.
wchar_t c;

// Always slip straight into inquote mode.
p.Append ('"');
while ((c=Src.GetW()) != (wchar_t)-1)
{
if (c=='"')
{
// Double the backslashes
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
// Always write out the Double-Quote
// twice just in case the next char
// is also a quote.
p.Append ('"');
p.Append ('"');
}
else
{
p.Append (c);
}
NSlash = (c=='\\') ? NSlash+1 : 0;
}

// The last slash, if any, also has to be doubled.
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
p.Append ('"');

return Retval;
}



The 'Buffer' and 'Pointer' is part of a string library I use -- they are used so little here that I think you can figure out what they do.

The next topic: how CMD.EXE parses the command line before your program ever sees it.