Tuesday, June 26, 2007

CMD.EXE and C Run Time Library Argument Parsing

CMD.EXE Parsing - Splitting Into Arguments



Microsoft's CMD.EXE parses command lines into arguments in perhaps the least intuitive and most aggrevating manner possible. You may think that it's documented "somewhere" -- but it's not. Even worse, each command line is actually parsed twice. The first parse by CMD.EXE replaces environment variables and psuedo-environment variables and does caret-fiddling. CMD.EXE has also, by this time, split the command line into arguments. CMD.EXE then calls your program, passing in not the carefully split up arguments, but instead passed in just the straight command line. The C++ Run Time Library's 'parse_cmdline' routine (in 'stdargv.c') then takes the command line and splits it up into arguments again.

Today I'm just going to talk about the C Run-Time Library parsing. The CMD.EXE parsing is a whole different topic, also worthy of study.

Here is a trivial test batch file that simply calls batch file 'echop.bat' and C program 'textstat'; these programs in turn just echo their arguments (textstat can actually do rather a lot more). Here's a sample of the two when they're given normal arguments:


REM
REM try a batch program
REM
call echop first second third

REM
REM and a regular C program
REM
textstat -enum cl first second third



The results of this are straightforward:


echop results:

C:\tmp>call echop first second third
all_args=<<first second third>>
arg_1 <<first>>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<textstat -enum cl first second third>>
_acmdln <<textstat -enum cl first second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<first>>
argv_4 <<second>>
argv_5 <<third>>




But now let's look at the basic C Run Time Library rules for parsing the command line:


  1. Special characters are Space, Tab, Double-Quote ("), Backslash, and Nul (the terminating 'NUL', or zero, character)
  2. The first argument (the program name) is parsed specially
  3. There are two modes: 'inquote' mode and regular. Generally speaking, in regular mode arguments are terminated by Spaces, Tabs and of couse Nul. The parser starts in 'regular' mode. The full set of 'regular' rules are

    1. Whitespace rule: All initial Spaces and Tab are skipped over
    2. Ending rule: A Space, Tab, or Nul will end the current argument
    3. Done rule: A Nul will also end all parsing
    4. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero).

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then switch into 'inquote' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.
      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'regular' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    5. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.

    The full set of rules for 'inquote' mode are:

    1. Ending rule: A Nul will end the current argument. A Double-Quote, although commonly put at the end of an arugment, does not technically end the argument (instead it switches back to 'regular' mode and the next Space or Tab ends the argument)
    2. Done rule: A Nul will also end all parsing
    3. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero) and possibly followed by another Double-Quote.

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then

        1. If there is a following Double-Quote, then stay in 'inquote' mode, add to the current argument half of the Backslashes (none if there were none, one if there were two, etc), and the following Double-Quote only.
        2. If there is not a following Double-Quote, then switch back to 'regular' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.

      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'inquote' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    4. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.



The least intuitive results from these is that if you pass in a directory path that ends in a Backslash (a common example would be 'C:\'), and you surround it with Double-Quotes (like this: '"C:\"'), then you will be surprised.


  1. As expected, the first Double-Quote puts you into 'inquote' mode
  2. Also as expected the C: part of the argument is added to the argument
  3. The last Double-Quote is preceeded by an odd number (one is an odd number) of Backslashes. This means that we stay in 'inquote' mode, the Backslash is not added to the argument and the Double-Quote is added to the argument. This also means that any extra arguments are slapped onto the first.


Here's an example of what that looks like. Once again, using a trivial batch file and program to show what the arguments are:


@echo echop results:
call echop "C:\" second third
@echocr
@echo textstat -enum cl results:
"texts"tat -enum cl "C:\" second third


The results are odd, but thanks to the rules will make sense:


C:\tmp>dbg
echop results:

C:\tmp>call echop "C:\" second third
all_args=<<"C:\" second third>>
arg_1 <<"C:\">>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<"texts"tat -enum cl "C:\" second third>>
_acmdln <<"texts"tat -enum cl "C:\" second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<C:" second third>>



Note particularly that argv_3, which you might have expected to be just a c:\, is instead C:" second third -- all of the arguments are whacked together and the backslash is gone.

CMD.EXE, when it ran the 'ECHO' command, parsed the command line one way, but the C Run Time Library parsed it a very different way. In particular, CMD.EXE doesn't strip the quotes off of the arguments and doesn't do the backslash expansion. One could make many snarky comments about this.


You might be wondering what 'echop.bat' looks like -- here it is in all its glory:


@echo off
echo all_args=^<^<%*^>^>
echo arg_1 ^<^<%1^>^>
echo arg_2 ^<^<%2^>^>
echo arg_3 ^<^<%3^>^>
echo arg_4 ^<^<%4^>^>
echo arg_5 ^<^<%5^>^>


You might also be wondering about 'textstat' -- well, that's a much bigger program. Not that echoing command parameters is all that hard, but 'textstat' does rather a lot more than that.

Saturday, June 23, 2007

Smart Module, Dumb Module



A rarely considered part of designing modules and interfaces is: should the module be dumb, and do exactly what it's told, or should it be complex (+). An interface like the classic "C" "strcpy" routine is a good example of simple:


char* strcpy (char* Dest, const char* Src)
{
for (char* const p = Src; *p; p++)
{
*Dest++ = *p;
}
return Dest;
}


It does exactly and precisely what it's told: it doesn't cache the value, or decide to escape some characters in the Dest string++, or skip over quotes, or decide to stop if Dest is getting "too full".

An example of a complex module would be -- well, there aren't any good examples. Good examples should be short; complex modules aren't short. A complex version of string copying, though, would include escaping the incoming string to handling quoting rules, or adding line numbers, or converting HTML syntax to regular text.

SP Rule of Complex Modules:

Never interface two complex modules


Why not? Because you'll spend all of your time trying to convince module 'a' that it should convince module 'b' to actually do what you want and not something clever that you don't want at all.

Here is a classic example of complex modules interacting:

In the early days of the IBM PC, there were just a few kinds of graphic cards: the "monochrome" card that could just display text+++, but was very fast and very crisp, and the "cga" card which was slower and fuzzy but which could display color and bitmaps.

The first programs would be written for one or the other.

Thus, we see two dumb modules connecting. Not ideal, because you have to buy the right program, but not bad. Note that the 'dumb' modules are in fact pretty full of stuff -- but their interactions with each other are dumb.

Next came two different phases. In one phase, the program writers wanted to sell their programs on both kinds of IBM PCs -- the ones with CGA cards, and the ones with Monochrome cards. They figured out that if they queried each card for some value, they could figure out what kind of card you had attached++++. And thus the programs became complex

In the other phase, clever hardware people figured out that it would be keen to support both the CGA and Monochrome interfaces -- that way people would buy the fancy new graphics card and be able to use programs written for either CGA or for Monochrome cards. The new video cards would listen in at both interfaces, and report back correct information on both. And thus the video card became complex.

And now for the disaster: the complex program would probe for a Monochrome graphics card. The complex video card would detect the probe and respond correctly, switching into Monochrome mode. The program would detect the response, think that it was dealing with a Monochrome card, and switch into Monochrome mode. The two sophisticated systems would mutually agree to switch into the worse of the two modes.

And it's all because there were two complex systems talking to each other.

----------------------------------------------------

+my boss say, "call it sophisticated". I think that words like "sophisticated" should be reserved for Gary Grant and Audrey Hebburn - debonair, suave, and mouthing witty sayings over martinis. This doesn't describe any of the programs I've ever worked on.

++A pox on Microsoft Vista's competing disk and registry virtualization.

+++Young readers might not actually believe this, but it was true. Memory was too expensive in those days to waste on a bitmap display unless there was a very good reason. Instead the video card had about 4k of text memory; the internal circuitry included a "character generator ROM" that would expand the individual characters into the bitmap. You can still purchase these ROMs.

++++Of course, you could have both attached. This was actually common in the Microsoft Windows 3.1 programming world because you would run Microsoft Windows on the graphics card and your debugger would run its interface on the monochrome card. The monochrome card was nice because once you sent data to it, the data would be displayed: the operating system didn't have to drive it or interpret the data.

Thursday, June 21, 2007

Palm Pilots: Could Do Better

Palm Pilots: Could Do Better



I first played with a small computer in an EE class twenty years ago -- a tiny little SBC (Single Board Computer) with a 6801 on it. My partners and I programmed it with the help of a Unix machine which converted assembler into hex; we had to type the hex into the board ourselves. It was total fun, and I've been leaning towards little computers ever since.

My current small computer is a little Palm Tungsten E. By "little" I mean that it's more powerful than the VAX that me and ten other people used to program on professionally. There are more languages for the Palm, and it's tons easier to program -- except that Palm doesn't really want me to program it. Instead, there are lots of languages that individual programmers have made, gotten to partly work, and then abandoned, and some giaganto-thing to program it on a real machine and then download -- but wow! What sea of mediocrity, and me stuck here without a sextant. Or map.

Looking at the Palm website, they have a ton of documents -- but half are about the shiny new way of programming for Palm (but which doesn't work on mine because it's only for the 'new' Palm OS that it's pretty clear isn't ever going to be released), and half are for really old Palms (which aren't super useful for me).

What's missing is simplicity. My time for poking at this is measured in the hours and half hours -- a little more if I wake up extra early, or if the kids sleep late. If, in that hour I can't figure out what to download, have the download just work and just install (I looked at one Palm program that, for "convenience" divided itself into fifty pieces so you could add this feature, or not that feature, or something to support really old Palm, or for oddball ones. I know I certainly wasn't about to try to figure out what was what, and didn't even extract from the zip file), and then write a simple "hello world" program --well then, it's too late.

Something that's done well, on the other hand is 'ant' -- a build system that works on Window (and probably others). Lots of decent features, reasonable documentation, and when something fails, I always know it's my fault :-)