Phases of translation (annotated)

在C99的translation phases共分8步 。translation phases描述了從C source code到program image的處理流程。

參考整理 https://en.cppreference.com/w/c/language/translation_phases &ISO C99標準

The C source file is processed by the compiler as if the following phases take place, in this exact order. Actual implementation may combine these actions or process them differently as long as the behavior is the same.

這邊提到 phase of translation只要求compiler實現時表現的行為與這裡描述的步驟等價即可。

phase 1 主要是做character set轉換,和轉換斷行符<EOL>

Physical source file multibyte characters are mapped, in an implementation defined manner, ① to the source character set(introducing ② new-line characters for end-of-line indicators) if necessary. ③ Trigraph sequences are replaced by corresponding single-character internal representations.

其中提到source character set指的是如何解讀source file(encoding) ,source character set在標準裡5.2.1有描述:

Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).

其他有關source character set在可參考以下連結有比較詳細的說明
https://stackoverflow.com/questions/27872517/what-are-the-different-character-sets-used-for
https://stackoverflow.com/questions/15558977/characters-defined-using-uxxxx-format-display-the-wrong-character

或是gcc的cpp(preprocessor) doc也有值得參考的解釋 (參考: https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html#Character-sets ),但須注意gcc cpp doc和規範描述的略有出入,標準中提到的source character應該是對應gcc cpp 描述的 input-charset,而cpp doc這裡提到的source character set是指在這個階段將實際c的source file讀進來時, 要用什麼字元集來做內部處理。

C99在這裡沒有特別規範compiler內部的internal representation。在C99 rationale(page 20)裡有提到一些C和C++裡對於這部分的差異
http://www.open-std.org/jtc1/sc22/wg14/www/C99RationaleV5.10.pdf

對比C++的phase 1是轉成basic source character&UCN:節錄 https://en.cppreference.com/w/cpp/language/translation_phases
Any source file character that cannot be mapped to a character in the basic source character set is replaced by its universal character name (escaped with \u or \U) or by some implementation-defined form that is handled equivalently.

phase 1除了source character set mapping外,另外也會將<EOL>換成<LF>,所以在後續的phase,會提到new-line character,是來自於此步驟。

phase 2 line splicing 去行尾的 \ 來做拼接

Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines 註: 這裡的new-line character就是從第一步<EOL>轉過來的

這裡規定解析拼接行時,只看 \ + <LF> 。 那如果 \ + 空白 + <LF>會發生什麼事呢? 照理按照上述說明,不會被拼接。但是gcc的實現寬鬆了這樣的限制,只給出warning,說明可參考以下

The preprocessor treatment of escaped newlines is more relaxed than that specified by the C90 standard, which requires the newline to immediately follow a backslash.)

另外,在這個階段標準還要求 1. 檔案要以<LF>結尾 2. 最後不能是 \ + <LF>

A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such splicing takes place

注意的是這邊標準裡用的是shall,代表如果violate屬於undefined behavior。在gcc裡只是給個warning

phase 3 tokenization(for preprocessing)

The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments)

在這個階段,將source file區分成空白(包括註解,註解被取代為一個)、和preprocessing tokens,有關preprocessing token在6.4有詳細說明。

A preprocessing token is the minimal lexical element of the language in translation phases 3 through 6.

指的是compile(phase 7)前 preprocessor看到的token

Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined

其中“each” comment部分用一個<SPACE>取代。除了new-line character需要保留外,其他的空白(一個或多個)可以選擇保留或是用一個<SPACE>取代

phase 4 preprocessing

此階段執行所以的Preprocessing directive和macro expansion

這邊值得注意的是其中有關#include的描述

A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively

這邊提到#include 是recursive做處理,這也解釋了為什麼如果a.h和b.h交互include編譯器會報錯(如果沒有特別去用#pragma once或#ifdef擋掉的話。ps. #pragma once不在標準)。

phase 5 mapping to execution character set mapping

Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set;

這一步主要是轉換成execution character set,給compiler

參考: https://stackoverflow.com/questions/3768363/character-sets-not-clear

phase 6 string concatenation

Adjacent string literal tokens are concatenated
單純string literal連接,這一步已經是execution character set 。將”str1″ “str2” 連成 “str1str2”

phase 7 compile

Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

這一步就是compile,compile成translation unit

phase 8 link

All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment

最後一步是link,最後產生成program image

This entry was posted in C Language. Bookmark the permalink.

Leave a Reply