Tuesday, June 26, 2007

CMD.EXE and C Run Time Library Argument Parsing

CMD.EXE Parsing - Splitting Into Arguments



Microsoft's CMD.EXE parses command lines into arguments in perhaps the least intuitive and most aggrevating manner possible. You may think that it's documented "somewhere" -- but it's not. Even worse, each command line is actually parsed twice. The first parse by CMD.EXE replaces environment variables and psuedo-environment variables and does caret-fiddling. CMD.EXE has also, by this time, split the command line into arguments. CMD.EXE then calls your program, passing in not the carefully split up arguments, but instead passed in just the straight command line. The C++ Run Time Library's 'parse_cmdline' routine (in 'stdargv.c') then takes the command line and splits it up into arguments again.

Today I'm just going to talk about the C Run-Time Library parsing. The CMD.EXE parsing is a whole different topic, also worthy of study.

Here is a trivial test batch file that simply calls batch file 'echop.bat' and C program 'textstat'; these programs in turn just echo their arguments (textstat can actually do rather a lot more). Here's a sample of the two when they're given normal arguments:


REM
REM try a batch program
REM
call echop first second third

REM
REM and a regular C program
REM
textstat -enum cl first second third



The results of this are straightforward:


echop results:

C:\tmp>call echop first second third
all_args=<<first second third>>
arg_1 <<first>>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<textstat -enum cl first second third>>
_acmdln <<textstat -enum cl first second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<first>>
argv_4 <<second>>
argv_5 <<third>>




But now let's look at the basic C Run Time Library rules for parsing the command line:


  1. Special characters are Space, Tab, Double-Quote ("), Backslash, and Nul (the terminating 'NUL', or zero, character)
  2. The first argument (the program name) is parsed specially
  3. There are two modes: 'inquote' mode and regular. Generally speaking, in regular mode arguments are terminated by Spaces, Tabs and of couse Nul. The parser starts in 'regular' mode. The full set of 'regular' rules are

    1. Whitespace rule: All initial Spaces and Tab are skipped over
    2. Ending rule: A Space, Tab, or Nul will end the current argument
    3. Done rule: A Nul will also end all parsing
    4. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero).

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then switch into 'inquote' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.
      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'regular' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    5. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.

    The full set of rules for 'inquote' mode are:

    1. Ending rule: A Nul will end the current argument. A Double-Quote, although commonly put at the end of an arugment, does not technically end the argument (instead it switches back to 'regular' mode and the next Space or Tab ends the argument)
    2. Done rule: A Nul will also end all parsing
    3. Quoting " \" \\" \\"" rule: Double-Quotes are preceeded by a number of Backslashes (the number might be zero) and possibly followed by another Double-Quote.

      1. Even rule: If the number of Backslashes is even (zero, two, four, etc), then

        1. If there is a following Double-Quote, then stay in 'inquote' mode, add to the current argument half of the Backslashes (none if there were none, one if there were two, etc), and the following Double-Quote only.
        2. If there is not a following Double-Quote, then switch back to 'regular' mode and add in half of the Backslashes (none if there were none, one if there were two, etc). The Double-Quote is not added to the current argument.

      2. Odd rule: If the number of Backslashes was odd (one, three, five, etc) then stay in 'inquote' mode and add to the current argument half of the Backslashes (rounding down so that one Backslashes becomes none in the argument, three Backslashes becomes one, five becomes two, etc) and the Double-Quote.

    4. Backslashes-not-escape rule: Backslashes are not otherwise escape characters. The are only special if they come before a Double-Quote.



The least intuitive results from these is that if you pass in a directory path that ends in a Backslash (a common example would be 'C:\'), and you surround it with Double-Quotes (like this: '"C:\"'), then you will be surprised.


  1. As expected, the first Double-Quote puts you into 'inquote' mode
  2. Also as expected the C: part of the argument is added to the argument
  3. The last Double-Quote is preceeded by an odd number (one is an odd number) of Backslashes. This means that we stay in 'inquote' mode, the Backslash is not added to the argument and the Double-Quote is added to the argument. This also means that any extra arguments are slapped onto the first.


Here's an example of what that looks like. Once again, using a trivial batch file and program to show what the arguments are:


@echo echop results:
call echop "C:\" second third
@echocr
@echo textstat -enum cl results:
"texts"tat -enum cl "C:\" second third


The results are odd, but thanks to the rules will make sense:


C:\tmp>dbg
echop results:

C:\tmp>call echop "C:\" second third
all_args=<<"C:\" second third>>
arg_1 <<"C:\">>
arg_2 <<second>>
arg_3 <<third>>
arg_4 <<>>
arg_5 <<>>

textstat -enum cl results:
;type value
GetCommandLine <<"texts"tat -enum cl "C:\" second third>>
_acmdln <<"texts"tat -enum cl "C:\" second third>>
argv_0 <<textstat>>
argv_1 <<-enum>>
argv_2 <<cl>>
argv_3 <<C:" second third>>



Note particularly that argv_3, which you might have expected to be just a c:\, is instead C:" second third -- all of the arguments are whacked together and the backslash is gone.

CMD.EXE, when it ran the 'ECHO' command, parsed the command line one way, but the C Run Time Library parsed it a very different way. In particular, CMD.EXE doesn't strip the quotes off of the arguments and doesn't do the backslash expansion. One could make many snarky comments about this.


You might be wondering what 'echop.bat' looks like -- here it is in all its glory:


@echo off
echo all_args=^<^<%*^>^>
echo arg_1 ^<^<%1^>^>
echo arg_2 ^<^<%2^>^>
echo arg_3 ^<^<%3^>^>
echo arg_4 ^<^<%4^>^>
echo arg_5 ^<^<%5^>^>


You might also be wondering about 'textstat' -- well, that's a much bigger program. Not that echoing command parameters is all that hard, but 'textstat' does rather a lot more than that.

No comments: