Wednesday, July 18, 2007

Command-line parsing: more details

Command-line parsing: more details



In the June 26, 2007 post "CMD.EXE Parsing - Splitting into Arguments" I listed the rules that Microsoft's C RTL uses when console apps split their command line into arguments. Rule number two was "2. The first argument (the program name) is parsed specially".

The actual rule for the first argument isn't terribly complex. Like regular arguments you're either in 'inquote' mode or not; when in 'inquote' mode the only end for the argument is the NUL character (but you can switch out of 'inquote' mode). Unlike the regular rules, there are escape characters -- a backslash is just a backslash, and there's no way to embed a double-quote. The double-quotes that switch in and out of 'inquote' mode are, of course, not considered part of the argument.

When not in 'inquote' mode a space or tab exits the parsing.

By the way -- Microsoft actually publishes the underlying source code for all of this. When you compile in DEBUG mode, you can even set breakpoints on the C RTL startup part of your code. In the "Visual Studio 8" directory (which is what they call 'Visual Studio 20005'), look for VC\crt\src\stdargv.c

Sunday, July 1, 2007

Escaping Strings for MSFT RTL Parsing

Escaping Strings for MSFT RTL Parsing



Today's topic is how to convert any string into one that can be parsed an Microsoft RTL-using program. Note, however, that not all strings can be so converted -- in particular, any string with an embedded NUL character is not convertable.

Why do this? Because sometimes you have to generate output that other programs will read in through their command line. When you do, you have to escape your output so that the other program will handle it correctly. For instance, suppose I want to generate a string with a space in it (like "C:\Program Files\myprog\foo.txt") -- how can I write it out so that the other program gets exactly what I wrote?

This is the inverse problem of the last post. In the last post, the question answered was: what are the rules that the Microsoft Run-Time-Library has for parsing command line arguments. Note that today's solution is not complete: the various shells will ALSO do their own parsing (to expand environment variables) before handing off the command line.

The hard case, and therefore the one to tackle first, is to convert a string so that it can be put into Double-Quotes. The rules from the last entry say that the only character that has to be escaped is the duoble-quote, but it that it has very funny rules about multiple double quotes, and whether the number of preceeding backslashes is odd or even. It also depends on whether we're already in 'inquote' mode or not.

To make life easy, let's decide to always start off with a Double-Quote and always stay in 'inquote' mode. Further, we have to figure out how to deal with the following-Double-Quote rule (number 3.1.1 in 'inquote' mode). We could write fancy code to figure out if there was going to be a following Double-Quote, but instead I'll take the easy way out and always add in the following Double-Quote myself. That way the next Double-Quote doesn't have to be handled in an extra-special way.

The parsing rules say that backslashes aren't handled specially unless they are preceeding a Double-Quote. I'll have to keep track of how many there are in a row.

And finally we get the following code:


Buffer BORString::Convert::AnyToMSFTCArg (const Pointer& Src_In)
{
Pointer Src (Src_In);
Buffer Retval;
Retval.Resize (Src.StrLen() + 2 + 4);
// Pre-size the return buffer once. It will
// automatically resize as needed

// p is the temporary pointer into Retval
Pointer p = Retval.p();

// Keep track of how many backslashes we've
// seen that preceed a Double-Quote.
int NSlash = 0;

// Temp. character. Everything is wide-char,
// but should trivially convert to narrow.
wchar_t c;

// Always slip straight into inquote mode.
p.Append ('"');
while ((c=Src.GetW()) != (wchar_t)-1)
{
if (c=='"')
{
// Double the backslashes
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
// Always write out the Double-Quote
// twice just in case the next char
// is also a quote.
p.Append ('"');
p.Append ('"');
}
else
{
p.Append (c);
}
NSlash = (c=='\\') ? NSlash+1 : 0;
}

// The last slash, if any, also has to be doubled.
for (int i=0; i<NSlash; i++)
{
p.Append ('\\');
}
p.Append ('"');

return Retval;
}



The 'Buffer' and 'Pointer' is part of a string library I use -- they are used so little here that I think you can figure out what they do.

The next topic: how CMD.EXE parses the command line before your program ever sees it.

Tuesday, June 26, 2007

CMD.EXE and C Run Time Library Argument Parsing

CMD.EXE Parsing - Splitting Into Arguments



Microsoft's CMD.EXE parses command lines into arguments in perhaps the least intuitive and most aggrevating manner possible. You may think that it's documented "somewhere" -- but it's not. Even worse, each command line is actually parsed twice. The first parse by CMD.EXE replaces environment variables and psuedo-environment variables and does caret-fiddling. CMD.EXE has also, by this time, split the command line into arguments. CMD.EXE then calls your program, passing in not the carefully split up arguments, but instead passed in just the straight command line. The C++ Run Time Library's 'parse_cmdline' routine (in 'stdargv.c') then takes the command line and splits it up into arguments again.

Today I'm just going to talk about the C Run-Time Library parsing. The CMD.EXE parsing is a whole different topic, also worthy of study.

Here is a trivial test batch file that simply calls batch file 'echop.bat' and C program 'textstat'; these programs in turn just echo their arguments (textstat can actually do rather a lot more). Here's a sample of the two when they're given normal arguments:


REM
REM try a batch program
REM
call echop first second third

REM
REM and a regular C program
REM
textstat -enum cl first second third



The results of this are straightforward:


echop results:

C:\tmp>call echop first second third
all_args=<<first second third>>
arg_1 <<first>>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<textstat -enum cl first second third>>
_acmdln <<textstat -enum cl first second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<first>>
argv_4 <<second>>
argv_5 <<third>>




But now let's look at the basic C Run Time Library rules for parsing the command line:


  1. Special characters are Space, Tab, Double-Quote ("), Backslash, and Nul (the terminating 'NUL', or zero, character)
  2. The first argument (the program name) is parsed specially
  3. There are two modes: 'inquote' mode and regular. Generally speaking, in regular mode arguments are terminated by Spaces, Tabs and of couse Nul. The parser starts in 'regular' mode. The full set of 'regular' rules are

    1. Whitespace rule: All initial Spaces and Tab are skipped over
    2. Ending rule: A Space, Tab, or Nul will end the current argument
    3. Done rule: A Nul will also end all parsing
    4. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero).

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then switch into 'inquote' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.
      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'regular' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    5. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.

    The full set of rules for 'inquote' mode are:

    1. Ending rule: A Nul will end the current argument. A Double-Quote, although commonly put at the end of an arugment, does not technically end the argument (instead it switches back to 'regular' mode and the next Space or Tab ends the argument)
    2. Done rule: A Nul will also end all parsing
    3. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero) and possibly followed by another Double-Quote.

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then

        1. If there is a following Double-Quote, then stay in 'inquote' mode, add to the current argument half of the Backslashes (none if there were none, one if there were two, etc), and the following Double-Quote only.
        2. If there is not a following Double-Quote, then switch back to 'regular' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.

      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'inquote' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    4. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.



The least intuitive results from these is that if you pass in a directory path that ends in a Backslash (a common example would be 'C:\'), and you surround it with Double-Quotes (like this: '"C:\"'), then you will be surprised.


  1. As expected, the first Double-Quote puts you into 'inquote' mode
  2. Also as expected the C: part of the argument is added to the argument
  3. The last Double-Quote is preceeded by an odd number (one is an odd number) of Backslashes. This means that we stay in 'inquote' mode, the Backslash is not added to the argument and the Double-Quote is added to the argument. This also means that any extra arguments are slapped onto the first.


Here's an example of what that looks like. Once again, using a trivial batch file and program to show what the arguments are:


@echo echop results:
call echop "C:\" second third
@echocr
@echo textstat -enum cl results:
"texts"tat -enum cl "C:\" second third


The results are odd, but thanks to the rules will make sense:


C:\tmp>dbg
echop results:

C:\tmp>call echop "C:\" second third
all_args=<<"C:\" second third>>
arg_1 <<"C:\">>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<"texts"tat -enum cl "C:\" second third>>
_acmdln <<"texts"tat -enum cl "C:\" second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<C:" second third>>



Note particularly that argv_3, which you might have expected to be just a c:\, is instead C:" second third -- all of the arguments are whacked together and the backslash is gone.

CMD.EXE, when it ran the 'ECHO' command, parsed the command line one way, but the C Run Time Library parsed it a very different way. In particular, CMD.EXE doesn't strip the quotes off of the arguments and doesn't do the backslash expansion. One could make many snarky comments about this.


You might be wondering what 'echop.bat' looks like -- here it is in all its glory:


@echo off
echo all_args=^<^<%*^>^>
echo arg_1 ^<^<%1^>^>
echo arg_2 ^<^<%2^>^>
echo arg_3 ^<^<%3^>^>
echo arg_4 ^<^<%4^>^>
echo arg_5 ^<^<%5^>^>


You might also be wondering about 'textstat' -- well, that's a much bigger program. Not that echoing command parameters is all that hard, but 'textstat' does rather a lot more than that.

Saturday, June 23, 2007

Smart Module, Dumb Module



A rarely considered part of designing modules and interfaces is: should the module be dumb, and do exactly what it's told, or should it be complex (+). An interface like the classic "C" "strcpy" routine is a good example of simple:


char* strcpy (char* Dest, const char* Src)
{
for (char* const p = Src; *p; p++)
{
*Dest++ = *p;
}
return Dest;
}


It does exactly and precisely what it's told: it doesn't cache the value, or decide to escape some characters in the Dest string++, or skip over quotes, or decide to stop if Dest is getting "too full".

An example of a complex module would be -- well, there aren't any good examples. Good examples should be short; complex modules aren't short. A complex version of string copying, though, would include escaping the incoming string to handling quoting rules, or adding line numbers, or converting HTML syntax to regular text.

SP Rule of Complex Modules:

Never interface two complex modules


Why not? Because you'll spend all of your time trying to convince module 'a' that it should convince module 'b' to actually do what you want and not something clever that you don't want at all.

Here is a classic example of complex modules interacting:

In the early days of the IBM PC, there were just a few kinds of graphic cards: the "monochrome" card that could just display text+++, but was very fast and very crisp, and the "cga" card which was slower and fuzzy but which could display color and bitmaps.

The first programs would be written for one or the other.

Thus, we see two dumb modules connecting. Not ideal, because you have to buy the right program, but not bad. Note that the 'dumb' modules are in fact pretty full of stuff -- but their interactions with each other are dumb.

Next came two different phases. In one phase, the program writers wanted to sell their programs on both kinds of IBM PCs -- the ones with CGA cards, and the ones with Monochrome cards. They figured out that if they queried each card for some value, they could figure out what kind of card you had attached++++. And thus the programs became complex

In the other phase, clever hardware people figured out that it would be keen to support both the CGA and Monochrome interfaces -- that way people would buy the fancy new graphics card and be able to use programs written for either CGA or for Monochrome cards. The new video cards would listen in at both interfaces, and report back correct information on both. And thus the video card became complex.

And now for the disaster: the complex program would probe for a Monochrome graphics card. The complex video card would detect the probe and respond correctly, switching into Monochrome mode. The program would detect the response, think that it was dealing with a Monochrome card, and switch into Monochrome mode. The two sophisticated systems would mutually agree to switch into the worse of the two modes.

And it's all because there were two complex systems talking to each other.

----------------------------------------------------

+my boss say, "call it sophisticated". I think that words like "sophisticated" should be reserved for Gary Grant and Audrey Hebburn - debonair, suave, and mouthing witty sayings over martinis. This doesn't describe any of the programs I've ever worked on.

++A pox on Microsoft Vista's competing disk and registry virtualization.

+++Young readers might not actually believe this, but it was true. Memory was too expensive in those days to waste on a bitmap display unless there was a very good reason. Instead the video card had about 4k of text memory; the internal circuitry included a "character generator ROM" that would expand the individual characters into the bitmap. You can still purchase these ROMs.

++++Of course, you could have both attached. This was actually common in the Microsoft Windows 3.1 programming world because you would run Microsoft Windows on the graphics card and your debugger would run its interface on the monochrome card. The monochrome card was nice because once you sent data to it, the data would be displayed: the operating system didn't have to drive it or interpret the data.

Thursday, June 21, 2007

Palm Pilots: Could Do Better

Palm Pilots: Could Do Better



I first played with a small computer in an EE class twenty years ago -- a tiny little SBC (Single Board Computer) with a 6801 on it. My partners and I programmed it with the help of a Unix machine which converted assembler into hex; we had to type the hex into the board ourselves. It was total fun, and I've been leaning towards little computers ever since.

My current small computer is a little Palm Tungsten E. By "little" I mean that it's more powerful than the VAX that me and ten other people used to program on professionally. There are more languages for the Palm, and it's tons easier to program -- except that Palm doesn't really want me to program it. Instead, there are lots of languages that individual programmers have made, gotten to partly work, and then abandoned, and some giaganto-thing to program it on a real machine and then download -- but wow! What sea of mediocrity, and me stuck here without a sextant. Or map.

Looking at the Palm website, they have a ton of documents -- but half are about the shiny new way of programming for Palm (but which doesn't work on mine because it's only for the 'new' Palm OS that it's pretty clear isn't ever going to be released), and half are for really old Palms (which aren't super useful for me).

What's missing is simplicity. My time for poking at this is measured in the hours and half hours -- a little more if I wake up extra early, or if the kids sleep late. If, in that hour I can't figure out what to download, have the download just work and just install (I looked at one Palm program that, for "convenience" divided itself into fifty pieces so you could add this feature, or not that feature, or something to support really old Palm, or for oddball ones. I know I certainly wasn't about to try to figure out what was what, and didn't even extract from the zip file), and then write a simple "hello world" program --well then, it's too late.

Something that's done well, on the other hand is 'ant' -- a build system that works on Window (and probably others). Lots of decent features, reasonable documentation, and when something fails, I always know it's my fault :-)