replace_function_name

replace_function_name是由一个完整的clang编译器，在CodeGen时通过修改CodeGenModule::getMangledName函数的流程来达到修改函数名的目的。

项目地址 https://github.com/penguin-wwy/replace_function_name

详细使用方式和效果对比请看github。

下面来分析实现过程。

首先看IR生成的过程。

clang::ParseAST是clang Parse的主要入口

void clang::ParseAST(Sema &S, bool PrintStats, bool SkipFunctionBodies) {
  ...
  ASTConsumer *Consumer = &S.getASTConsumer();

  std::unique_ptr<Parser> ParseOP(
      new Parser(S.getPreprocessor(), S, SkipFunctionBodies));
  Parser &P = *ParseOP.get();

  ...

  S.getPreprocessor().EnterMainSourceFile();
  P.Initialize();

  Parser::DeclGroupPtrTy ADecl;
  ExternalASTSource *External = S.getASTContext().getExternalSource();
  if (External)
    External->StartTranslationUnit(Consumer);

  for (bool AtEOF = P.ParseFirstTopLevelDecl(ADecl); !AtEOF;
       AtEOF = P.ParseTopLevelDecl(ADecl)) {
    // If we got a null return and something *was* parsed, ignore it.  This
    // is due to a top-level semicolon, an action override, or a parse error
    // skipping something.
    if (ADecl && !Consumer->HandleTopLevelDecl(ADecl.get()))
      return;
  }

  // Process any TopLevelDecls generated by #pragma weak.
  for (Decl *D : S.WeakTopLevelDecls())
    Consumer->HandleTopLevelDecl(DeclGroupRef(D));
  ...

通过ParseTopLevelDecl接口，开始将源码分析生成AST。由于clang的词法分析Lex和Parse是一体的，所以这里也是词法分析的入口。

从这里获得AST之后进入HandleTopLevelDecl对AST进行处理。这里的consumer是BackendConsumer，HandleTopLevelDecl也只是一个计时的动作，主要是从这里进入CodeGeneratorImpl::HandleTopLevelDecl

bool HandleTopLevelDecl(DeclGroupRef D) override {
      PrettyStackTraceDecl CrashInfo(*D.begin(), SourceLocation(),
                                     Context->getSourceManager(),
                                     "LLVM IR generation of declaration");

      // Recurse.
      if (llvm::TimePassesIsEnabled) {
        LLVMIRGenerationRefCount += 1;
        if (LLVMIRGenerationRefCount == 1)
          LLVMIRGeneration.startTimer();
      }

      Gen->HandleTopLevelDecl(D);

      if (llvm::TimePassesIsEnabled) {
        LLVMIRGenerationRefCount -= 1;
        if (LLVMIRGenerationRefCount == 0)
          LLVMIRGeneration.stopTimer();
      }

      return true;
    }

然后是迭代处理DeclGroupRef中的每一个Decl，EmitTopLevelDecl是具体Decl的转化入口。

1
2
3

// Make sure to emit all elements of a Decl.
for (DeclGroupRef::iterator I = DG.begin(), E = DG.end(); I != E; ++I)
  Builder->EmitTopLevelDecl(*I);

这里就正式进入了CodeGen模块的作用范畴，跟ParseAST就没有关系了。通过Consumer来完成不同功能的使用，看得出clang的设计十分清晰合理。

void CodeGenModule::EmitTopLevelDecl(Decl *D) {
	...
	switch (D->getKind()) {
  case Decl::CXXConversion:
  case Decl::CXXMethod:
  case Decl::Function:
    // Skip function templates
    if (cast<FunctionDecl>(D)->getDescribedFunctionTemplate() ||
        cast<FunctionDecl>(D)->isLateTemplateParsed())
      return;

    EmitGlobal(cast<FunctionDecl>(D));
    // Always provide some coverage mapping
    // even for the functions that aren't emitted.
    AddDeferredUnusedCoverageMapping(D);
    break;
    ...

我们主要关注函数名生成所以看FunctionDecl的处理了。

在EmitTopLevelDecl中直接空降到函数体的处理

// Defer code generation to first use when possible, e.g. if this is an inline
// function. If the global must always be emitted, do it eagerly if possible
// to benefit from cache locality.
if (MustBeEmitted(Global) && MayBeEmittedEagerly(Global)) {
  // Emit the definition if it can't be deferred.
  EmitGlobalDefinition(GD);
  return;
}

IR中的函数有define和declare前一种是定义，后一种对应声明，也就是从其他编译单元引用的函数，虽然函数体不在但是会进行函数声明。这里会选择define函数，declare函数会在其他编译单元转化。

void CodeGenModule::EmitGlobalDefinition(GlobalDecl GD, llvm::GlobalValue *GV) {
  ...

    if (const auto *Method = dyn_cast<CXXMethodDecl>(D)) {
      // Make sure to emit the definition(s) before we emit the thunks.
      // This is necessary for the generation of certain thunks.
      if (const auto *CD = dyn_cast<CXXConstructorDecl>(Method))
        ABI->emitCXXStructor(CD, getFromCtorType(GD.getCtorType()));
      else if (const auto *DD = dyn_cast<CXXDestructorDecl>(Method))
        ABI->emitCXXStructor(DD, getFromDtorType(GD.getDtorType()));
      else
        EmitGlobalFunctionDefinition(GD, GV);

      if (Method->isVirtual())
        getVTables().EmitThunks(GD);

      return;
    }
  ...
}

C++构造和析构函数会有单独的处理，其余函数不管是成员函数还是C风格函数都会进入EmitGlobalFunctionDefinition

EmitGlobalFunctionDefinition通过调用GetAddrOfFunction获取函数原型

llvm::Constant *CodeGenModule::GetAddrOfFunction(GlobalDecl GD,
                                                 llvm::Type *Ty,
                                                 bool ForVTable,
                                                 bool DontDefer,
                                              ForDefinition_t IsForDefinition) {
  // If there was no specific requested type, just convert it now.
  if (!Ty) {
    const auto *FD = cast<FunctionDecl>(GD.getDecl());
    auto CanonTy = Context.getCanonicalType(FD->getType());
    Ty = getTypes().ConvertFunctionType(CanonTy, FD);
  }

  StringRef MangledName = getMangledName(GD);
  return GetOrCreateLLVMFunction(MangledName, Ty, GD, ForVTable, DontDefer,
                                 /*IsThunk=*/false, llvm::AttributeList(),
                                 IsForDefinition);
}

getMangledName按照规则将函数的类名、函数名、返回值、参数、平衡栈方式、属性全部用于生成函数独一的函数名，到这里多态、重载这些高级特性统统落地，变成唯一的函数声明和函数调用。

值得注意的地方在于getMangledName函数中对于函数名会先到MangledDeclNames中去寻找是否已经生成过

1
2
3

StringRef &FoundStr = MangledDeclNames[CanonicalGD];
  if (!FoundStr.empty())
    return FoundStr;

同时在函数最后，也会将本次生成的函数名与全局变量的对应关系保存下来，用于下次查找

1	`auto Result = Manglings.insert(std::make_pair(Str, GD));`

这里的Str变量就是最后保存的函数名，也就是IR中对该函数所使用的名称。只要我们此时将它修改，就完成了对于函数名的修改。

for ( auto I = obfMap.begin(); I != obfMap.end(); I++ ) {
    if (StringRef(I->first) == Str) {
        Str = StringRef(I->second);
    }
}

通过一个字典保存需要替换函数名函数的MangledName，和它对应md5，我们就可以用md5完成对函数名的替换。-

Open Source Project

LLVM Security

本博客所有文章除特别声明外，均采用 CC BY-SA 4.0 协议，转载请注明出处！

LLDB与Python 上一篇

代码混淆——复杂度进化下一篇