116:一生一芯E4(2)

预处理

预处理属于正式编译之前的步骤, 其本质是文本处理. 通常有以下工作：

头文件包含
宏替换
去掉注释
连接因断行符(行尾的\)而拆分的字符串
处理条件编译#ifdef/#else/#endif
处理字符串化操作符#
处理标识符连接操作符##

我们可以通过-E作为编译器选项，来获取预处理结果

E4.1.1 观察预处理结果

#include <stdio.h>
#define MSG "Hello \
World!\n"
#define _str(x) #x
#define _concat(a, b) a##b
int main() {
  printf(MSG /* "hi!\n" */);
#ifdef __riscv
  printf("Hello RISC-V!\n");
#endif
  _concat(pr, intf)(_str(RISC-V));
  return 0;
}

尝试运行上述gcc命令, 然后对比预处理结果和源文件的区别.

可以发现宏定义被展开了，同时引入了stdio.h中的内容

E4.1.2 如何寻找头文件

尝试执行gcc -E a.c --verbose > /dev/null, 并在输出的结果中寻找和头文件相关的内容.

在man gcc中搜索-I选项相关的说明并阅读.

了解后, 尝试创建一些stdio.h文件, 然后通过-I选项让gcc包含你创建的stdio.h, 而不是标准库中的stdio.h. 通过-E选项来检查预处理结果是否符合你的预期.

[20:38:07] Ylin@Ylin /home/Ylin/programs/C
> gcc -E code.c --verbose > /dev/null
...
 /usr/libexec/gcc/x86_64-linux-gnu/14/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu code.c -mtune=generic -march=x86-64 -fasynchronous-unwind-tables -dumpbase code.c -dumpbase-ext .c
# 上面是gcc -E code.c --verbose执行时实际执行的命令，调用了cc1(C编译器前端)来完成这个任务
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/14/include-fixed/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/14/include-fixed"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/14/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /usr/lib/gcc/x86_64-linux-gnu/14/include	  # gcc自带头文件
 /usr/local/include							# 本地头文件	
 /usr/include/x86_64-linux-gnu				 # 架构头文件
 /usr/include							    # 系统头文件
End of search list.
# 下面分别是编译器查找路径和库查找路径
COMPILER_PATH=/usr/libexec/gcc/x86_64-linux-gnu/14/:/usr/libexec/gcc/x86_64-linux-gnu/14/:/usr/libexec/gcc/x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/14/:/usr/lib/gcc/x86_64-linux-gnu/
LIBRARY_PATH=/usr/lib/gcc/x86_64-linux-gnu/14/:/usr/lib/gcc/x86_64-linux-gnu/14/../../../x86_64-linux-gnu/:/usr/lib/gcc/x86_64-linux-gnu/14/../../../../lib/:/lib/x86_64-linux-gnu/:/lib/../lib/:/usr/lib/x86_64-linux-gnu/:/usr/lib/../lib/:/usr/lib/gcc/x86_64-linux-gnu/14/../../../:/lib/:/usr/lib/
COLLECT_GCC_OPTIONS='-E' '-v' '-mtune=generic' '-march=x86-64'

利用本地头文件优先于系统头文件这一点，我们可以用自己的stdio.h来覆盖系统的stdio.h

[20:46:33] Ylin@Ylin /home/Ylin/programs/C
> cat stdio.h
#define MSG "This is a fake stdio.h\n"
[20:46:41] Ylin@Ylin /home/Ylin/programs/C
> gcc -E code.c
# 0 "code.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "code.c"
# 1 "stdio.h" 1
# 2 "code.c" 2

int main() {
  printf("This is a fake stdio.h\n" );
  printf("RISC-V");
  return 0;
}

E4.1.2 观察预处理结果(2)

尝试安装面向RISC-V架构的gcc, 并用其进行预处理:
apt-get install g++-riscv64-linux-gnu
riscv64-linux-gnu-gcc -E a.c
查看此时的预处理结果, 你发现有什么新的变化?

关键的变化在于这里：

# 5 "code.c"
int main() {
  printf("Hello World\n" );

  printf("Hello RISC-V!\n");

  printf("RISC-V");
  return 0;
}

多了一个Hello RISC-V,这说明先前的条件宏生效了，我们可以通过echo | gcc -dM -E - | sort来观察编译器预定义的宏

E4.1.3 对比gcc和riscv64-linux-gnu-gcc的预定义宏

尝试对比gcc和riscv64-linux-gnu-gcc的预定义宏, 从而了解两者在预处理时的差异. 你只需要简单了解这些差异即可, 无需深入了解每一个宏的具体含义.

Hint:

使用diff或者相关的命令能帮助你快速找到两个文件的不同之处

如果你想了解一些宏的含义, 可以查阅gcc的相关手册

得到:(省略了很多)

0a1,2
> #define __amd64 1
> #define __amd64__ 1
3a6,7
> #define __ATOMIC_HLE_ACQUIRE 65536
> #define __ATOMIC_HLE_RELEASE 131072
...
< #define __riscv 1
< #define __riscv_a 2001000
< #define __riscv_arch_test 1
< #define __riscv_atomic 1
< #define __riscv_c 2000000
< #define __riscv_cmodel_medany 1
< #define __riscv_compressed 1
< #define __riscv_d 2002000
< #define __riscv_div 1
286a312,313
> #define __SEG_FS 1
> #define __SEG_GS 1
294a322
> #define __SIZEOF_FLOAT128__ 16
295a324
> #define __SIZEOF_FLOAT80__ 16
308a338,341
> #define __SSE__ 1
> #define __SSE2__ 1
> #define __SSE2_MATH__ 1
> #define __SSE_MATH__ 1
365a399,400
> #define __x86_64 1
> #define __x86_64__ 1

我们可以看到使用risc64-linux-gnu-gcc确实会引入宏定义#define __riscv 1

编译

实际上就是将一种语言转换到另一种语言的过程，就是编译器的职责。例如gcc会将C语言转换成x86汇编语言; 而riscv64-linux-gnu-gcc会将C语言转换成riscv64汇编语言。

接下来为了更好的理解编译的过程，我们使用clang来进行编译，接下来使用这个程序来研究编译的过程：

#include <stdio.h>
int main() { // compute 10 + 20
  int x = 10, y = 20;
  int z = x + y;
  printf("z = %d\n", z);
  return 0;
}

E4.2.1 了解编译的过程

尝试查阅man clang, 阅读其中关于编译阶段的介绍, 从而大致了解编译过程.

Driver The  clang  executable  is actually a small driver which controls the overall execution of other tools such as the compiler, assembler and linker.  Typically you do not need to interact with the driver, but you transparently use it to run the  other tools.

       Preprocessing
              This  stage  handles  tokenization  of the input source file, macro expansion, #include expansion and handling of other pre‐
              processor directives.  The output of this stage is typically called a “.i” (for C),  “.ii”  (for  C++),  “.mi”  (for  Objec‐
              tive-C), or “.mii” (for Objective-C++) file.

       Parsing and Semantic Analysis
              This  stage  parses the input file, translating preprocessor tokens into a parse tree.  Once in the form of a parse tree, it
              applies semantic analysis to compute types for expressions as well and determine whether the code is well formed. This stage
              is responsible for generating most of the compiler warnings as well as parse errors. The output of this  stage  is  an  “Ab‐
              stract Syntax Tree” (AST).

       Code Generation and Optimization
              This  stage  translates  an  AST into low-level intermediate code (known as “LLVM IR”) and ultimately to machine code.  This
              phase is responsible for optimizing the generated code and handling target-specific code generation.   The  output  of  this
              stage is typically called a “.s” file or “assembly” file.

              Clang  also  supports  the  use of an integrated assembler, in which the code generator produces object files directly. This
              avoids the overhead of generating the “.s” file and of calling the target assembler.

       Assembler
              This stage runs the target assembler to translate the output of the compiler into a target object file. The output  of  this
              stage is typically called a “.o” file or “object” file.

       Linker This  stage  runs the target linker to merge multiple object files into an executable or dynamic library. The output of this
              stage is typically called an “a.out”, “.dylib” or “.so” file.

这一部分按编译过程的层次进行介绍：

驱动：像gcc和clang本质上是一个驱动程序，他负责控制编译过程中各类工具的整体执行
预处理：此阶段处理输入源文件的词法分析、宏展开、#include 展开以及其他预处理指令的处理。
解析与语义分析：此阶段解析输入文件，将预处理词元转换为解析树。一旦形成解析树，就会应用语义分析来计算表达式的类型，并确定代码是否格式良好。此阶段负责生成大多数编译器警告以及解析错误。
代码生成与优化：此阶段将 AST 转换为低级中间代码（称为"LLVM IR"），并最终转换为机器代码。此阶段负责优化生成的代码和处理特定目标的代码生成。
汇编器：此阶段运行目标汇编器，将编译器的输出转换为目标对象文件
链接器：此阶段运行目标链接器，将多个对象文件合并为可执行文件或动态库。

词法分析

词法分析的工作是识别并记录源文件中的每一个token, 可以通过以下命令查看结果：

clang -fsyntax-only -Xclang -dump-tokens a.c

本质上就是一个字符串匹配器，这也是我认为编译过程中比较麻烦和繁琐的一部分：

int 'int'        [StartOfLine]  Loc=<code.c:3:1>
identifier 'main'        [LeadingSpace] Loc=<code.c:3:5>
l_paren '('             Loc=<code.c:3:9>
r_paren ')'             Loc=<code.c:3:10>
l_brace '{'      [LeadingSpace] Loc=<code.c:3:12>
int 'int'        [StartOfLine] [LeadingSpace]   Loc=<code.c:4:3>
identifier 'x'   [LeadingSpace] Loc=<code.c:4:7>
equal '='        [LeadingSpace] Loc=<code.c:4:9>
numeric_constant '10'    [LeadingSpace] Loc=<code.c:4:11>
comma ','               Loc=<code.c:4:13>
identifier 'y'   [LeadingSpace] Loc=<code.c:4:15>
equal '='        [LeadingSpace] Loc=<code.c:4:17>
numeric_constant '20'    [LeadingSpace] Loc=<code.c:4:19>
semi ';'                Loc=<code.c:4:21>
int 'int'        [StartOfLine] [LeadingSpace]   Loc=<code.c:5:3>
identifier 'z'   [LeadingSpace] Loc=<code.c:5:7>
equal '='        [LeadingSpace] Loc=<code.c:5:9>
identifier 'x'   [LeadingSpace] Loc=<code.c:5:11>
plus '+'         [LeadingSpace] Loc=<code.c:5:13>
identifier 'y'   [LeadingSpace] Loc=<code.c:5:15>
semi ';'                Loc=<code.c:5:16>
identifier 'printf'      [StartOfLine] [LeadingSpace]   Loc=<code.c:6:3>
l_paren '('             Loc=<code.c:6:9>
string_literal '"z = %d\n"'             Loc=<code.c:6:10>
comma ','               Loc=<code.c:6:20>
identifier 'z'   [LeadingSpace] Loc=<code.c:6:22>
r_paren ')'             Loc=<code.c:6:23>
semi ';'                Loc=<code.c:6:24>
return 'return'  [StartOfLine] [LeadingSpace]   Loc=<code.c:7:3>
numeric_constant '0'     [LeadingSpace] Loc=<code.c:7:10>
semi ';'                Loc=<code.c:7:11>
r_brace '}'      [StartOfLine]  Loc=<code.c:8:1>
eof ''          Loc=<code.c:8:2>

这是上面的程序的词法分析结果。

语法分析

语法分析的工作是按照C语言的语法将识别出的token组织成树状结构, 从而梳理出源程序的层次结构,。语法分析的结果通常通过抽象语法树(Abstrace Syntax Tree, AST)的方式呈现. 可以通过如下命令来查看语法分析的结果:

clang -fsyntax-only -Xclang -ast-dump a.c

得到：

...
`-FunctionDecl 0x64d7882cede8 <code.c:3:1, line:8:1> line:3:5 main 'int ()'
  `-CompoundStmt 0x64d7882cf2d0 <col:12, line:8:1>
    |-DeclStmt 0x64d7882cefe8 <line:4:3, col:21>
    | |-VarDecl 0x64d7882ceea8 <col:3, col:11> col:7 used x 'int' cinit
    | | `-IntegerLiteral 0x64d7882cef10 <col:11> 'int' 10
    | `-VarDecl 0x64d7882cef48 <col:3, col:19> col:15 used y 'int' cinit
    |   `-IntegerLiteral 0x64d7882cefb0 <col:19> 'int' 20
    |-DeclStmt 0x64d7882cf110 <line:5:3, col:16>
    | `-VarDecl 0x64d7882cf018 <col:3, col:15> col:7 used z 'int' cinit
    |   `-BinaryOperator 0x64d7882cf0f0 <col:11, col:15> 'int' '+'
    |     |-ImplicitCastExpr 0x64d7882cf0c0 <col:11> 'int' <LValueToRValue>
    |     | `-DeclRefExpr 0x64d7882cf080 <col:11> 'int' lvalue Var 0x64d7882ceea8 'x' 'int'
    |     `-ImplicitCastExpr 0x64d7882cf0d8 <col:15> 'int' <LValueToRValue>
    |       `-DeclRefExpr 0x64d7882cf0a0 <col:15> 'int' lvalue Var 0x64d7882cef48 'y' 'int'
    |-CallExpr 0x64d7882cf228 <line:6:3, col:23> 'int'
    | |-ImplicitCastExpr 0x64d7882cf210 <col:3> 'int (*)(const char *, ...)' <FunctionToPointerDecay>
    | | `-DeclRefExpr 0x64d7882cf128 <col:3> 'int (const char *, ...)' Function 0x64d7882ad7e8 'printf' 'int (const char *, ...)'
    | |-ImplicitCastExpr 0x64d7882cf270 <col:10> 'const char *' <NoOp>
    | | `-ImplicitCastExpr 0x64d7882cf258 <col:10> 'char *' <ArrayToPointerDecay>
    | |   `-StringLiteral 0x64d7882cf180 <col:10> 'char[8]' lvalue "z = %d\n"
    | `-ImplicitCastExpr 0x64d7882cf288 <col:22> 'int' <LValueToRValue>
    |   `-DeclRefExpr 0x64d7882cf1a0 <col:22> 'int' lvalue Var 0x64d7882cf018 'z' 'int'
    `-ReturnStmt 0x64d7882cf2c0 <line:7:3, col:10>
      `-IntegerLiteral 0x64d7882cf2a0 <col:10> 'int' 0

语义分析

语法分析的工作是按照C语言的语义确定AST中每个表达式的类型. 在这个过程中, 相容的类型将根据C语言标准进行类型转换(如算术类型提升).而对于不符合语义的情况, 则报告错误.

这里要注意语法分析和语义分析的区别：

语法分析：代码是否符合语言的语法规则（括号匹配，语句结构…）
语义分析：代码的逻辑意义是否合理（类型检查，作用域检查，表达式检查…）

同样的可以通过命令来分析程序：

clang a.c --analyze -Xanalyzer -analyzer-output=text

如果没有问题就不会报错，类似于gcc中的-Wall

[21:53:33] Ylin@Ylin /home/Ylin/programs/C
> clang test.c --analyze -Xanalyzer -analyzer-output=text
test.c:5:6: warning: Use of memory after it is freed [unix.Malloc]
    5 |   *p = 0;
      |   ~~ ^
test.c:3:12: note: Memory is allocated
    3 |   int *p = malloc(sizeof(*p) * 10);
      |            ^~~~~~~~~~~~~~~~~~~~~~~
test.c:4:3: note: Memory is released
    4 |   free(p);
      |   ^~~~~~~
test.c:5:6: note: Use of memory after it is freed
    5 |   *p = 0;
      |   ~~ ^
1 warning generated.

中间代码生成

中间代码是一种由编译器定义的, 面向编译场景的ISA, 也称中间表示(Intermediate Representation, IR)或中间语言(Intermediate Language). 可以通过如下命令来查看clang生成的中间代码:

clang -S -emit-llvm a.c
cat a.ll

我一开始不明白为什么要中间代码，后来我总结原因有二：

一是因为可以让所有语言翻译到中间代码，再由中间代码翻译到所有的ISA，这样就省去很多麻烦。

             frontend                              backend
           +----------+                        +------------+
      C -> |  Clang   | -+                 +-> |  llvm-x86  | -> x86
           +----------+  |                 |   +------------+
           +----------+  +-> +----------+ -+   +------------+
Fortran -> | llvm-gcc | ---> | llvm-opt | ---> |  llvm-arm  | -> ARM
           +----------+  +-> +----------+ -+   +------------+
           +----------+  |                 |   +------------+
Haskell -> |    GHC   | -+                 +-> | llvm-riscv | -> RISC-V
           +----------+  LLVM IR      LLRM IR  +------------+

二是生成中间代码便于我们在这个层面上做好编译优化，再生成进一步的ISA指令，可以减轻编译器的代码优化的负担。

我们可以看看code.c的中间代码是啥样的

[21:58:31] Ylin@Ylin /home/Ylin/programs/C
> cat code.ll
; ModuleID = 'code.c'
source_filename = "code.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

@.str = private unnamed_addr constant [8 x i8] c"z = %d\0A\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  %3 = alloca i32, align 4
  %4 = alloca i32, align 4
  store i32 0, ptr %1, align 4
  store i32 10, ptr %2, align 4
  store i32 20, ptr %3, align 4
  %5 = load i32, ptr %2, align 4
  %6 = load i32, ptr %3, align 4
  %7 = add nsw i32 %5, %6
  store i32 %7, ptr %4, align 4
  %8 = load i32, ptr %4, align 4
  %9 = call i32 (ptr, ...) @printf(ptr noundef @.str, i32 noundef %8)
  ret i32 0
}

declare i32 @printf(ptr noundef, ...) #1

attributes #0 = { noinline nounwind optnone uwtable "frame-pointer"="all" "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "frame-pointer"="all" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cmov,+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.module.flags = !{!0, !1, !2, !3, !4}
!llvm.ident = !{!5}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{i32 8, !"PIC Level", i32 2}
!2 = !{i32 7, !"PIE Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{i32 7, !"frame-pointer", i32 2}
!5 = !{!"Debian clang version 19.1.7 (3+b1)"}

感觉看不太懂啊，比较复杂。

编译优化

这一部分负责将程序中的无用的部分给去除，或是将进一步避免一些不必要的性能开销。但是需要注意，程序的性能优化前后必须要满足程序可观测行为的一致，这样的优化才是正确的。

通常的程序优化技术如下：

常量传播

当一个程序中存在常量表达式时，可以在编译阶段直接计算结果，避免在执行中的计算开销：

//          优化前              |            优化后
  int a = 1;                   |    int a = 1;
  int b = a + 2;               |    int b = 3;
  printf("%d\n", b * 3);       |    printf("%d\n", 9);

死代码消除

无法到达的语句和不被使用的变量可以移除，以精简指令。

//          优化前              |            优化后
  #define DEBUG 0              |    #define DEBUG 0
  int fun(int x) {             |    int fun(int x) {
    int a = x + 3;             |      return x / 2;
    if (DEBUG) {               |    }
      printf("a = %d\n", a);   |
    }                          |
    return x / 2;              |
  }                            |

消除冗余操作

没有被读取的写入操作可以移除

//          优化前              |            优化后
  int a;                       |    int a;
  a = 3;                       |    f();
  a = f();                     |    a = 10;
  a = 7;                       |
  a = 10;                      |

代码强度削减

可以用简单的运算操作代替复杂的预算操作，以提升程序性能

//          优化前           |            优化后
int x = a[i * 4];            |    int x = a[i << 2];

提取公共子表达式

对于多次计算的子表达式，可以用中间变量存储，减少计算开销

//         优化前              |            优化后
  int x = a * b - 1;           |    int temp = a * b;
  int y = a * b * 2;           |    int x = temp - 1;
                               |    int y = temp * 2;

循环不变代码外提

每次循环结果都一样的代码, 可以将其提到循环之前进行一次计算

//          优化前              |            优化后
  int a = f1();                |    int x = f1() + 2;
  for (i = 0; i < 10; i ++) {  |    for (i = 0; i < 10; i ++) {
    int x = a + 2;             |      int y = f2(x);
    int y = f2(x);             |      sum += y + i;
    sum += y + i;              |    }
  }                            |

函数内敛

较小的函数可以直接在调用处展开，节省函数调用的开销

//          优化前              |            优化后
  int f1(int x, int y) {       |    int f1(int x, int y) {
    return x + y;              |      return x + y;
  }                            |    }
  int f2(int x) {              |    int f2(int x) {
    return f1(x, 3);           |      return x + 3;
  }

还有很多…

我们可以在clang的使用中给出-O选项来开启编译优化，对于gcc，有-Ofast > -O3 > -O2 > -O1 > -Og > -O0(默认).的优化等级，我们还可以通过指令来查询编译过程中进行了哪些优化

E4.2.2 对比编译优化的结果

尝试对比添加-O1前后所生成的中间代码, 你发现添加-O1后, 生成的中间代码有何不同?

关键是这一部分：

; Function Attrs: nofree nounwind uwtable
define dso_local noundef i32 @main() local_unnamed_addr #0 {
  %1 = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef 30)
  ret i32 0
}

直接输出了结果，去除了中间常量传播的过程

E4.2.3 对比编译优化的结果(2)

尝试在int x = 10, y = 20;之前添加volatile关键字, 并重新生成-O1的中间代码. 和之前生成的中间代码对比, 你发现此时的中间代码有何不同?

; Function Attrs: nofree nounwind uwtable
define dso_local noundef i32 @main() local_unnamed_addr #0 {
  %1 = alloca i32, align 4
  %2 = alloca i32, align 4
  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %1)
  store volatile i32 10, ptr %1, align 4, !tbaa !5
  call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %2)
  store volatile i32 20, ptr %2, align 4, !tbaa !5
  %3 = load volatile i32, ptr %1, align 4, !tbaa !5
  %4 = load volatile i32, ptr %2, align 4, !tbaa !5
  %5 = add nsw i32 %4, %3
  %6 = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str, i32 noundef %5)
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %2)
  call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %1)
  ret i32 0
}

中间这一部分被保留下来了，这是因为volatile的作用就是告诉编译器，当前变量不应该被优化，对于被它修饰的对象，必须严格的按照抽象机的规则进行执行，不得省略、合并、重排。

目标代码生成

目标代码生成的工作是将优化后的中间代码翻译成目标代码, 也即处理器相关的ISA. 将中间代码的变量翻译成ISA的变量，将中间代码的指令翻译成ISA的指令。

我们可以通过以下方法来生成目标代码：

clang -S a.c
cat a.s

clang默认会生成和本地环境相同的ISA汇编代码，我们可以通过--target=xxx来生成对应的汇编指令。这种编译方式我们称之为交叉编译。

clang -S a.c --target=riscv64-linux-gnu

E4.2.4 理解C代码与riscv指令序列的关联

阅读clang交叉编译得到的riscv64汇编代码, 并尝试指出哪一段汇编代码是由哪一段C代码编译得到的.

        .text
        .attribute      4, 16
        .attribute      5, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zmmul1p0"
        .file   "code.c"
        .globl  main                            # -- Begin function main
        .p2align        1
        .type   main,@function
main:                                   # @main
        .cfi_startproc
# %bb.0:
        addi    sp, sp, -48				
        .cfi_def_cfa_offset 48
        sd      ra, 40(sp)                      # 8-byte Folded Spill
        sd      s0, 32(sp)                      # 8-byte Folded Spill
        .cfi_offset ra, -8
        .cfi_offset s0, -16
        addi    s0, sp, 48
        .cfi_def_cfa s0, 0
        li      a0, 0
        sd      a0, -40(s0)                     # 8-byte Folded Spill
        sw      a0, -20(s0)
        li      a0, 10						# int x = 10;
        sw      a0, -24(s0)
        li      a0, 20						# int y = 20;
        sw      a0, -28(s0)
        lw      a0, -24(s0)
        lw      a1, -28(s0)
        addw    a0, a0, a1					# int z = x+y;
        sw      a0, -32(s0)
        lw      a1, -32(s0)
.Lpcrel_hi0:
        auipc   a0, %pcrel_hi(.L.str)
        addi    a0, a0, %pcrel_lo(.Lpcrel_hi0)
        call    printf						# printf()
                                        # kill: def $x11 killed $x10
        ld      a0, -40(s0)                     # 8-byte Folded Reload
        ld      ra, 40(sp)                      # 8-byte Folded Reload
        ld      s0, 32(sp)                      # 8-byte Folded Reload
        addi    sp, sp, 48
        ret
.Lfunc_end0:
        .size   main, .Lfunc_end0-main
        .cfi_endproc
                                        # -- End function
        .type   .L.str,@object                  # @.str
        .section        .rodata.str1.1,"aMS",@progbits,1
.L.str:
        .asciz  "z = %d\n"
        .size   .L.str, 8

        .ident  "Debian clang version 19.1.7 (3+b1)"
        .section        ".note.GNU-stack","",@progbits
        .addrsig
        .addrsig_sym printf

这里的.cfi是调试信息，感觉看下来和x86架构的ISA还是很不一样的，尤其是关于对栈的处理，看起来总感觉很别扭，需要一点时间去适应

E4.2.5理解C代码与riscv指令序列的关联(2)

添加-O1并重新编译得到riscv64汇编代码, 你发现生成的汇编代码有何不同? 它如何与C代码建立关联?

关键部分变动如下(上面是未优化的，下面是优化的)：

main:                                   # @main
        .cfi_startproc
# %bb.0:
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset %rbp, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register %rbp
        subq    $16, %rsp
        movl    $0, -4(%rbp)
        movl    $10, -8(%rbp)
        movl    $20, -12(%rbp)
        movl    -8(%rbp), %eax
        movl    -12(%rbp), %ecx
        addl    %ecx, %eax
        movl    %eax, -16(%rbp)
        movl    -16(%rbp), %esi
        leaq    .L.str(%rip), %rdi
        movb    $0, %al
        callq   printf@PLT
        xorl    %eax, %eax
        addq    $16, %rsp
        popq    %rbp
        .cfi_def_cfa %rsp, 8
        retq
# ============================================ #
main:                                   # @main
        .cfi_startproc
# %bb.0:
        pushq   %rax
        .cfi_def_cfa_offset 16
        movl    $10, 4(%rsp)
        movl    $20, (%rsp)
        movl    4(%rsp), %esi
        addl    (%rsp), %esi
        leaq    .L.str(%rip), %rdi
        xorl    %eax, %eax
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rcx
        .cfi_def_cfa_offset 8
        retq

开启优化之后的指令，通过变量复用和消除临时变量的方法减少了内存的访问。此外还通过指令选择优化的方法来优化，从结构上看，优化后的指令的栈空间也更加紧凑。

二进制文件的生成和执行

汇编

编译的结果是汇编代码. 但是汇编代码作为文本文件，实际上还是给我们阅读的，要想让处理器能成功执行，需要将其进一步转换成指令的二进制编码，这就是汇编器的工作。

我们通过以下指令生成相应的目标文件：

clang -c a.c
ls a.o

此时目标文件的内容已经是二进制文件了，我们要想阅读，需要使用工具将其解析成可读的文本内容,这个步骤我们称之为反汇编，对于交叉编译出来的目标文件，我们需要使用指定的工具链来进行反编译：

objdump -d a.o
riscv64-linux-gnu-odjdump -d a.o

E4.3.1 查看riscv64目标文件的反汇编结果

根据上述命令, 查看riscv64目标文件的反汇编结果, 并将其与编译器生成的汇编文件进行对比.

code.o:     file format elf64-littleriscv


Disassembly of section .text:

0000000000000000 <main>:
   0:   1101                    addi    sp,sp,-32
   2:   ec06                    sd      ra,24(sp)
   4:   e822                    sd      s0,16(sp)
   6:   1000                    addi    s0,sp,32
   8:   47a9                    li      a5,10
   a:   fef42423                sw      a5,-24(s0)
   e:   47d1                    li      a5,20
  10:   fef42223                sw      a5,-28(s0)
  14:   fe842783                lw      a5,-24(s0)
  18:   0007871b                sext.w  a4,a5
  1c:   fe442783                lw      a5,-28(s0)
  20:   2781                    sext.w  a5,a5
  22:   9fb9                    addw    a5,a5,a4
  24:   fef42623                sw      a5,-20(s0)
  28:   fec42783                lw      a5,-20(s0)
  2c:   85be                    mv      a1,a5
  2e:   00000517                auipc   a0,0x0
  32:   00050513                mv      a0,a0
  36:   00000097                auipc   ra,0x0
  3a:   000080e7                jalr    ra # 36 <main+0x36>
  3e:   4781                    li      a5,0
  40:   853e                    mv      a0,a5
  42:   60e2                    ld      ra,24(sp)
  44:   6442                    ld      s0,16(sp)
  46:   6105                    addi    sp,sp,32
  48:   8082                    ret

下面是汇编器生成的指令：

        .file   "code.c"
        .option pic
        .attribute arch, "rv64i2p1_m2p0_a2p1_f2p2_d2p2_c2p0_zicsr2p0_zifencei2p0"
        .attribute unaligned_access, 0
        .attribute stack_align, 16
        .text
        .section        .rodata
        .align  3
.LC0:
        .string "z = %d\n"
        .text
        .align  1
        .globl  main
        .type   main, @function
main:
.LFB0:
        .cfi_startproc
        addi    sp,sp,-32
        .cfi_def_cfa_offset 32
        sd      ra,24(sp)
        sd      s0,16(sp)
        .cfi_offset 1, -8
        .cfi_offset 8, -16
        addi    s0,sp,32
        .cfi_def_cfa 8, 0
        li      a5,10
        sw      a5,-24(s0)
        li      a5,20
        sw      a5,-28(s0)
        lw      a5,-24(s0)
        sext.w  a4,a5
        lw      a5,-28(s0)
        sext.w  a5,a5
        addw    a5,a4,a5
        sw      a5,-20(s0)
        lw      a5,-20(s0)
        mv      a1,a5
        lla     a0,.LC0
        call    printf@plt
        li      a5,0
        mv      a0,a5
        ld      ra,24(sp)
        .cfi_restore 1
        ld      s0,16(sp)
        .cfi_restore 8
        .cfi_def_cfa 2, 32
        addi    sp,sp,32
        .cfi_def_cfa_offset 0
        jr      ra
        .cfi_endproc
.LFE0:
        .size   main, .-main
        .ident  "GCC: (Debian 14.2.0-19) 14.2.0"
        .section        .note.GNU-stack,"",@progbits

main函数的指令基本是一样的。

链接

链接的工作是将多个目标文件合并成最终的可执行文件. 可以通过如下命令来让生成链接后的可执行文件：

clang a.c
ls a.out

当然可执行程序也可以被反编译，因为他们本质上是一样的。如果查看可执行程序的反编译结果，会发现多了很多内容，实际上这就是其他目标文件链接过程中接入的，我们可以通过clang a.c -v来查看编译过程中链接阶段发生的事情：

 "/usr/bin/ld" --hash-style=gnu --build-id --eh-frame-hdr -m elf_x86_64 -pie -dynamic-linker /lib64/ld-linux-x86-64.so.2 -o a.out /lib/x86_64-linux-gnu/Scrt1.o /lib/x86_64-linux-gnu/crti.o /usr/lib/gcc/x86_64-linux-gnu/14/crtbeginS.o -L/usr/lib/gcc/x86_64-linux-gnu/14 -L/usr/lib/gcc/x86_64-linux-gnu/14/../../../../lib64 -L/lib/x86_64-linux-gnu -L/lib/../lib64 -L/usr/lib/x86_64-linux-gnu -L/usr/lib/../lib64 -L/lib -L/usr/lib /tmp/code-e5f4fe.o -lgcc --as-needed -lgcc_s --no-as-needed -lc -lgcc --as-needed -lgcc_s --no-as-needed /usr/lib/gcc/x86_64-linux-gnu/14/crtendS.o /lib/x86_64-linux-gnu/crtn.o

这里的crt*.o文件，是C runtime的缩写这些程序为可执行程序的运行提供必要的支持。

同样的，我们也可以通过交叉编译来生成riscv64的可执行程序.

E4.3.2 查看riscv64可执行文件的反汇编结果

尝试生成riscv64的可执行文件, 查看其反汇编结果, 并将其与链接前的目标文件进行对比.

除了main函数的汇编还多了一堆函数:

<.plt>过程链接表，用于绑定动态库函数（例如这里的printf）
<__libc_start_main@plt>用于调用libc的__libc_start_main
<printf@plt>用于调用printf
<_start>程序入口点
…

执行

对于编译出来的可执行文件，我们可以在命令行中执行它：

./a.out

但是此时程序是被存储在硬盘里的，而程序被执行就需要将其加载到内存中，而加载器就负责完成这个任务。它将程序加载到内存里，然后将PC跳转到程序的入口来执行。

剩下的内容就涉及到操作系统的层面，我就不过多讲述

E4.3.3 对比编译优化前后的性能差异

我们之前介绍了各种编译优化的选项, 现在你可以来体会一下这些选项的威力了. 以之前介绍的数列求和程序为例, 你可以测量不同编译优化等级下程序的运行时间, 从而体会不同优化等级对程序性能的影响.

不过之前的数列求和程序的末项为10, 为了方便测量出性能的区别, 你需要调整程序的末项, 以增加其执行时间. 如果程序的末项较大, 你还可以将sum变量的类型修改为long long. 你可以通过time命令来测量一条命令的执行时间, 如time ls将报告ls命令的执行时间.

调整数列的末项后, 分别在-O0, -O1, -O2下编译并测量程序的运行时间.

如果你感兴趣, 你还可以通过反汇编来查看相应的汇编代码, 并尝试根据汇编代码理解: 为什么会得到相应的性能提升? 编译器可能应用了哪些编译优化技术? 不过为了回答这些问题, 你可能需要通过RTFM或STFW来了解一些汇编指令的功能.

首先编写一个比较消耗时间的程序：

#include <stdio.h>

int main(){

    int i=10000000;
    while(i>0){
        i-=1*1*1;
    }
    return 0;
}

然后分别开启性能优化进行比较：

[15:36:49] Ylin@Ylin /home/Ylin/programs/C
> time ./O0

________________________________________________________
Executed in    9.24 millis    fish           external
   usr time    7.66 millis    0.00 millis    7.66 millis
   sys time    1.62 millis    1.62 millis    0.00 millis

[15:37:20] Ylin@Ylin /home/Ylin/programs/C
> time ./O1

________________________________________________________
Executed in    3.08 millis    fish           external
   usr time    2.45 millis  414.00 micros    2.03 millis
   sys time    0.77 millis  774.00 micros    0.00 millis

[15:37:22] Ylin@Ylin /home/Ylin/programs/C
> time ./O2

________________________________________________________
Executed in    2.12 millis    fish           external
   usr time    2.13 millis    1.02 millis    1.11 millis
   sys time    0.00 millis    0.00 millis    0.00 millis

E4.3.4 程序真的从main()开始执行吗?

尝试用strace或gdb验证你的想法.

Hint: gdb可使用starti命令, 让程序在第一条指令处暂停.

(gdb) starti
Starting program: /home/Ylin/programs/C/O0

Program stopped.
0x00007ffff7fe4280 in _start () from /lib64/ld-linux-x86-64.so.2

验证了__start是程序入口的说法。

同样的main()也不是程序的出口，我们可以通过strace找到退出程序的最后一条指令是exit_group，我们可以在gdb中追踪这个系统调用，得到此时的函数调用栈：

#0  __GI__exit (status=status@entry=0) at ../sysdeps/unix/sysv/linux/_exit.c:30
#1  0x00007ffff7e04236 in __run_exit_handlers (status=0, listp=0x7ffff7fa8680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
    run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:141
#2  0x00007ffff7e0437a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:148
#3  0x00007ffff7debcaf in __libc_start_call_main (main=main@entry=0x555555555130 <main>, argc=argc@entry=1,
    argv=argv@entry=0x7fffffffdbb8) at ../sysdeps/nptl/libc_start_call_main.h:74
#4  0x00007ffff7debd65 in __libc_start_main_impl (main=0x555555555130 <main>, argc=1, argv=0x7fffffffdbb8, init=<optimized out>,
    fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdba8) at ../csu/libc-start.c:360
#5  0x0000555555555061 in _start ()

最终我们可以推测出程序的执行流程：

_start
    ↓
__libc_start_main_impl
    ↓
__libc_start_call_main
    ↓
    main (你的代码)
    ↓
__run_exit_handlers ←─── atexit 注册的函数
    ↓
    _exit
    ↓
    syscall  ←─── 进入内核
    ↓
[进程终止]