This year, I have been working on thread-related performance optimization, such as thread convergence, thread stack optimization, and some OOM problems caused by threads. Recently, when searching the crash disk, I found some Native Crash problems caused by thread hangs, and found that this problem has existed for a long time, but the amount is not very large, belonging to the long-tailed problem, so I spent my energy to study it, and came up with some solutions, which I will discuss and share.

stack analysis (computing)

Case 1.

// Crash thread
signal:6 (SIGABRT),code:-1 (SI_QUEUE),fault addr:--------
Abort message:
Thread suspension timed out: 0x6f2e45d888:OkHttp https://dummy.global.com/...
backtrace:
// ignore more data

java stacktrace:
at dalvik.system.VMStack.getThreadStackTrace(VMStack.java)
at java.lang.Thread.getStackTrace(Thread.java:1841)
at java.lang.Thread.getAllStackTraces(Thread.java:1909)
at com.appsflyer.internal.AFa1xSDK$23740.AFInAppEventType(AFa1xSDK.java:113)
at com.appsflyer.internal.AFa1xSDK$23740.values(AFa1xSDK.java:168)
at com.appsflyer.internal.AFa1xSDK$23740.AFInAppEventParameterName(AFa1xSDK.java:73)
at com.appsflyer.internal.AFa1tSDK$28986.AFKeystoreWrapper(AFa1tSDK.java:38)
at java.lang.reflect.Method.invoke(Method.java)
at com.appsflyer.internal.AFc1oSDK.AFKeystoreWrapper(AFc1oSDK.java:159)
at com.appsflyer.internal.AFd1hSDK.values(AFd1hSDK.java:88)
at com.appsflyer.internal.AFd1oSDK.valueOf(AFd1oSDK.java:144)
at com.appsflyer.internal.AFd1zSDK.afErrorLog(AFd1zSDK.java:207)
at com.appsflyer.internal.AFc1bSDK.run(AFc1bSDK.java:4184)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:487)
at java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
at java.lang.Thread.run(Thread.java:1012)

Case 2.

// Crash thread
signal:6 (SIGABRT),code:-1 (SI_QUEUE),fault addr:--------
Abort message:
Thread suspension timed out: 0x70a383f4d8:DefaultDispatcher-worker-3
backtrace:
#00 pc 00000000000896fc  /apex/com.android.runtime/lib64/bionic/libc.so (abort+180)
#01 pc 000000000076fc20  /apex/com.android.art/lib64/libart.so (art::Runtime::Abort(char const*)+904)
#02 pc 00000000000357d0  /apex/com.android.art/lib64/libbase.so (android::base::SetAborter(std::__1::function<void (char const*)>&&)::$_0::__invoke(char const*)+80)
#03 pc 0000000000034d58  /apex/com.android.art/lib64/libbase.so (android::base::LogMessage::~LogMessage()+352)
#04 pc 000000000079bac0  /apex/com.android.art/lib64/libart.so (art::ThreadSuspendByPeerWarning(art::ScopedObjectAccess&, android::base::LogSeverity, char const*, _jobject*).__uniq.215660552210357940630679712151551015321+288)
#05 pc 000000000024c838  /apex/com.android.art/lib64/libart.so (art::ThreadList::SuspendThreadByPeer(_jobject*, art::SuspendReason, bool*)+3236)
#06 pc 00000000005949e8  /apex/com.android.art/lib64/libart.so (art::Thread_setNativeName(_JNIEnv*, _jobject*, _jstring*).__uniq.300150332875289415499171563183413458937+744)
#07 pc 0000000000439460  /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (art_jni_trampoline+128)

// ingore more data

java stacktrace:
at java.lang.Thread.setNativeName(Thread.java)
at java.lang.Thread.setName(Thread.java:1383)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.setIndexInArray(CoroutineScheduler.java:588)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryTerminateWorker(CoroutineScheduler.java:842)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.java:800)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryPark(CoroutineScheduler.java:740)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.java:711)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.java:664)

The above is the log dumped when the thread crashes, in which you can see the Java log, so it’s relatively good to analyze the timing of the crash. To summarize all the problems caused by thread hangs, there are two categories.

Appsflyer VMStack.getThreadStackTrace()
Coroutine Thread.setName()

The above two method calls each triggered a Linux termination signal at abort() , which caused the App to crash. Next, let’s analyze the flow of triggering the abort() signal in turn.

Thread.setName()

Based on the above stack logs we find that the modification of the thread name was triggered by a concurrent thread, so let’s take a look at that. First let’s trace what the concatenation did in the process of switching the scheduler while executing the task.

concurrent execution process

In Kotlin, a concatenation and a thread are two different concepts. Concurrencies are executed on the JVM through threads, but they are not directly bound to any particular thread. Multiple concurrent threads can run on a single thread, or they can flexibly switch between threads. This design allows a concatenation to hang while waiting for, for example, an I/O operation to complete without blocking the thread it is in, so that other concatenations can continue to execute on that thread.

The relationship between a concatenation and a thread in Kotlin is more of an abstraction than a direct dependency. Concurrent threads are controlled by dispatchers that control which threads or thread pools they execute on. For example, Dispatchers.Default is intended for CPU-intensive tasks and uses a shared thread pool by default, while Dispatchers.IO is optimized for I/O operations and also operates on a shared thread pool. In the Koltin protocol, the concept of a thread can be called Worker .

`Worker` creation

When we use the following co-programming code, we create an IO scheduler for doing network requests and other events, which triggers the creation process of Worker .

fun doSomething(){
    viewmodelScope.launch(Dispatchers.IO){
        //  do something...
    }
}

For both Dispatcher.IO & Dispatcher.Default internally CoroutineScheduler is used as the thread pool implementation.

The creation of Worker (threads) is built into the concatenation CoroutineScheduler .

private fun createNewWorker(): Int {
    synchronized(workers) {
        // Make sure we're not trying to resurrect terminated scheduler
        if (isTerminated) return -1
        val state = controlState.value
        val created = createdWorkers(state)
        val blocking = blockingTasks(state)
        val cpuWorkers = (created - blocking).coerceAtLeast(0)
        // Double check for overprovision
        if (cpuWorkers >= corePoolSize) return 0
        if (created >= maxPoolSize) return 0
        // start & register new worker, commit index only after successful creation
        val newIndex = createdWorkers + 1
        require(newIndex > 0 && workers[newIndex] == null)
        /*
         * 1) Claim the slot (under a lock) by the newly created worker
         * 2) Make it observable by increment created workers count
         * 3) Only then start the worker, otherwise it may miss its own creation
         */
        val worker = Worker(newIndex)
        workers.setSynchronized(newIndex, worker)
        require(newIndex == incrementCreatedWorkers())
        worker.start()
        return cpuWorkers + 1
    }
}

The above code is calculating a series of quantitative judgments, and ultimately if a Worker needs to be created it will initialize the Worker object, and then it will call Thread.start() .

internal inner class Worker private constructor() : Thread() {
    init {
        isDaemon = true
    }

    // guarded by scheduler lock, index in workers array, 0 when not in array (terminated)
    @Volatile // volatile for push/pop operation into parkedWorkersStack
    var indexInArray = 0
        set(index) {
            name = "$schedulerName-worker-${if (index == 0) "TERMINATED" else index.toString()}"
            field = index
        }

    constructor(index: Int) : this() {
        indexInArray = index
    }

 
}

OK, here we can see Worker . There is a member variable at indexInArray , and the set() method is used to change the thread name, so check out the thread Github repository issue to see why we need to change the thread name here.

What scenarios do we call `Thread.setName()` frequently ?

We already know that the concurrent scheduler actually does its own thread pooling logic, and how the threads are created internally are encapsulated in the CoroutineScheduler class. At this point, let’s go back and look at the crash log and what is the order of execution of this Thread.setName() .

The above flowchart can be understood as when we create more than one task thread hanging event, this task will go to the thread pool to find a worker , if not, it will be worker , when the thread pool internal spin checking task & worker number of state, if there is no task and the number of woker exceeds the number of core worker threads, then it will be recycled threads, so there exists a method tryPark to terminate the thread, and after that, when the termination of threads, you have to synchronize to change the name of the corresponding worker . When the thread is terminated, the name of the corresponding AtomicReferenceArray should be changed synchronously, because the overall data structure of worker exists as an array of , and then the value of index will be reduced by 1 in turn.

Why does Thread hang if I change the thread name or get the stack?

Understanding the scenario in which the problem occurs, let’s look at the cause of the hanging thread.

static void Thread_setNativeName(JNIEnv* env, jobject peer, jstring java_name) {
  ScopedUtfChars name(env, java_name);
  {
    ScopedObjectAccess soa(env);
    if (soa.Decode<mirror::Object>(peer) == soa.Self()->GetPeer()) {
      // 1.
      soa.Self()->SetThreadName(name.c_str());
      return;
    }
  }
  // Suspend thread to avoid it from killing itself while we set its name. We don't just hold the
  // thread list lock to avoid this, as setting the thread name causes mutator to lock/unlock
  // in the DDMS send code.
  ThreadList* thread_list = Runtime::Current()->GetThreadList();
  // Take suspend thread lock to avoid races with threads trying to suspend this one.
  // 2.
  Thread* thread = thread_list->SuspendThreadByPeer(peer, SuspendReason::kInternal);
  if (thread != nullptr) {
    {
      ScopedObjectAccess soa(env);
      thread->SetThreadName(name.c_str());
    }
    bool resumed = thread_list->Resume(thread, SuspendReason::kInternal);
    DCHECK(resumed);
  }
}

The above code is on the native side of the threaded call to Thread.setNativeName() in java via JNI eventually.

Code 1 is used to determine if the thread is changing its own name, if so, change the name directly, no need to hang.
In code 2, if A changes the name of B , you need to hang B and then change the thread name.

A note also exists in the source code about why it should be suspended first. Modifying the name of a thread in a multithreaded environment involves synchronization and management of the thread’s state, and directly modifying the name of an active thread may cause problems with the thread’s own state or with its interaction with other threads. Therefore, suspending a thread and safely modifying its name before resuming its operation is a necessary measure to ensure thread security.

How does the ART VM hang threads?

Thread hang check

Next, let’s look further into the details of the hang to go over how the SuspendThreadByPeer() function is implemented.

static constexpr useconds_t kThreadSuspendInitialSleepUs = 0;
static constexpr useconds_t kThreadSuspendMaxYieldUs = 3000;
static constexpr useconds_t kThreadSuspendMaxSleepUs = 5000;

Thread* ThreadList::SuspendThreadByPeer(jobject peer,
                                        SuspendReason reason,
                                        bool* timed_out) {
  bool request_suspension = true; 
  const uint64_t start_time = NanoTime();
  int self_suspend_count = 0; 
  useconds_t sleep_us = kThreadSuspendInitialSleepUs; 
  *timed_out = false; 
  Thread* const self = Thread::Current(); 
  Thread* suspended_thread = nullptr; 
  VLOG(threads) << "SuspendThreadByPeer starting";
  while (true) {
    Thread* thread;
    {
      ScopedObjectAccess soa(self);
      MutexLock thread_list_mu(self, *Locks::thread_list_lock_); 
      thread = Thread::FromManagedThread(soa, peer); 
      if (thread == nullptr) {

        if (suspended_thread != nullptr) {
          MutexLock suspend_count_mu(self, *Locks::thread_suspend_count_lock_);

          bool updated = suspended_thread->ModifySuspendCount(soa.Self(),
                                                              -1,
                                                              nullptr,
                                                              reason);
          DCHECK(updated);
        }

        ThreadSuspendByPeerWarning(soa,
                                   ::android::base::WARNING,
                                    "No such thread for suspend",
                                    peer);
        return nullptr;
      }

      if (!Contains(thread)) {
        CHECK(suspended_thread == nullptr);

        VLOG(threads) << "SuspendThreadByPeer failed for unattached thread: "
            << reinterpret_cast<void*>(thread);
        return nullptr;
      }
      VLOG(threads) << "SuspendThreadByPeer found thread: " << *thread;
      {
        MutexLock suspend_count_mu(self, *Locks::thread_suspend_count_lock_);
        if (request_suspension) {

          if (self->GetSuspendCount() > 0) {

            ++self_suspend_count;
            continue;
          }
          CHECK(suspended_thread == nullptr);

          suspended_thread = thread;

          bool updated = suspended_thread->ModifySuspendCount(self, +1, nullptr, reason);
          DCHECK(updated);
          request_suspension = false;
        } else {

          CHECK_GT(thread->GetSuspendCount(), 0);
        }
        CHECK_NE(thread, self) << "Attempt to suspend the current thread for the debugger";
        if (thread->IsSuspended()) {

          VLOG(threads) << "SuspendThreadByPeer thread suspended: " << *thread;
          if (ATraceEnabled()) {
            std::string name;
            thread->GetThreadName(name);
            ATraceBegin(StringPrintf("SuspendThreadByPeer suspended %s for peer=%p", name.c_str(),
                                      peer).c_str());
          }
          return thread;
        }

        const uint64_t total_delay = NanoTime() - start_time;
        if (total_delay >= thread_suspend_timeout_ns_) 

          if (suspended_thread == nullptr) {
            ThreadSuspendByPeerWarning(soa,
                                       ::android::base::FATAL,
                                       "Failed to issue suspend request",
                                       peer);
          } else {
            CHECK_EQ(suspended_thread, thread);
            LOG(WARNING) << "Suspended thread state_and_flags: "
                         << suspended_thread->StateAndFlagsAsHexString()
                         << ", self_suspend_count = " << self_suspend_count;

            Locks::thread_suspend_count_lock_->Unlock(self);
            ThreadSuspendByPeerWarning(soa,
                                       ::android::base::FATAL,
                                       "Thread suspension timed out",
                                       peer);
          }

          UNREACHABLE();
        } else if (sleep_us == 0 &&
            total_delay > static_cast<uint64_t>(kThreadSuspendMaxYieldUs) * 1000) 

          sleep_us = kThreadSuspendMaxYieldUs / 2;
        }
      }
    }
    VLOG(threads) << "SuspendThreadByPeer waiting to allow thread chance to suspend";

    ThreadSuspendSleep(sleep_us);      sleep_us = std::min(sleep_us * 2, kThreadSuspendMaxSleepUs);
  }
}

Each line of the above code is commented, and the logic is better understood at its core:

The spin wait is especially like the Handler + Looper mechanism in that it uses a dead loop + hibernation (hang). The hibernation method uses ThreadSuspendSleep(sleep_us) to hibernate.

static void ThreadSuspendSleep(useconds_t delay_us) {
  if (delay_us == 0) {
    sched_yield(); 
  } else {
    usleep(delay_us); 
  }
}

sched_yield()

#include <sched.h>
#include "syscall.h"
int sched_yield()
{
	return syscall(SYS_sched_yield);
}

Lets the scheduler give up the remaining time slice of the current thread, but it does not change the state of the thread, giving up the current CPU , which is immediately available to other threads, and the current thread remains in the ready state.

syscall is a low-level function common in Linux and other UNIX-like operating systems for initiating system calls directly from userspace.

usleep(delay_us)

#include <time.h>
#include "syscall.h"
int nanosleep(const struct timespec *req, struct timespec *rem)
{
  return syscall_cp(SYS_nanosleep, req, rem);
}

This sleep method changes the thread state and also lets out CPU .

Why not always use `usleep()` ?

Resource utilization and responsiveness: sched_yield() improves the responsiveness and resource utilization of the system. It allows the current thread to voluntarily yield the CPU without leaving the ready state, which means it can continue execution as soon as an execution opportunity arises. This is very useful in highly concurrent environments to reduce wait times and increase system throughput.
Avoid unnecessary delays: Using usleep() means that the current thread must wait for the specified amount of time before it can continue executing even if there are no other threads in the system that need to be run, which can lead to unnecessary delays.

hook up tag bit

The code in the hang check above has two key pieces of code that both point to the same function.

suspended_thread->ModifySuspendCount(self, +1, nullptr, reason);

This suspend_thread corresponds to Thread.cc . Let’s check it out:

bool Thread::ModifySuspendCountInternal(Thread* self,
                                        int delta,
                                        AtomicInteger* suspend_barrier,
                                        SuspendReason reason) {

  if (kIsDebugBuild) {
    DCHECK(delta == -1 || delta == +1)
          << reason << " " << delta << " " << this;

    Locks::thread_suspend_count_lock_->AssertHeld(self);

    if (this != self && !IsSuspended()) {
      Locks::thread_list_lock_->AssertHeld(self);
    }
  }

  if (UNLIKELY(reason == SuspendReason::kForUserCode)) {

    Locks::user_code_suspension_lock_->AssertHeld(self);

    if (UNLIKELY(delta + tls32_.user_code_suspend_count < 0)) {
      LOG(ERROR) << "attempting to modify suspend count in an illegal way.";
      return false;
    }
  }

  if (UNLIKELY(delta < 0 && tls32_.suspend_count <= 0)) {
    UnsafeLogFatalForSuspendCount(self, this);
    return false;
  }


  if (delta > 0 && this != self && tlsPtr_.flip_function != nullptr) {
    return false;
  }

  uint32_t flags = enum_cast<uint32_t>(ThreadFlag::kSuspendRequest);

  if (delta > 0 && suspend_barrier != nullptr) {
    uint32_t available_barrier = kMaxSuspendBarriers;

    for (uint32_t i = 0; i < kMaxSuspendBarriers; ++i) {
      if (tlsPtr_.active_suspend_barriers[i] == nullptr) {
        available_barrier = i;
        break;
      }
    }

    if (available_barrier == kMaxSuspendBarriers) {
      return false;
    }

    tlsPtr_.active_suspend_barriers[available_barrier] = suspend_barrier;
    flags |= enum_cast<uint32_t>(ThreadFlag::kActiveSuspendBarrier);
  }

  tls32_.suspend_count += delta;
  switch (reason) {
    case SuspendReason::kForUserCode:

      tls32_.user_code_suspend_count += delta;
      break;
    case SuspendReason::kInternal:

      break;
  }


  if (tls32_.suspend_count == 0) {
    AtomicClearFlag(ThreadFlag::kSuspendRequest);
  } else {      tls32_.state_and_flags.fetch_or(flags, std::memory_order_seq_cst);
    TriggerSuspend();
  }
  return true; 
}

ModifySuspendCount() The function will eventually execute ModifySuspendCountInternal() , the core lies in the code to set up the hang barrier, in fact, it is to set up a hang start point for tlsPtr_ , when suspend_count > 0 indicates that the current thread needs to be hung, but just set a flag, is not very much like in the handler mechanism?

When was the hang performed?

Here we have to talk about the Android checkpoint mechanism, remember our GC process? For example, when we execute System.GC , will GC be triggered? Interview veterans surely know that no, we must wait until all threads have reached a safe point before triggering the GC, then trigger the GC when you need to carry out Stop the World (of course, ART using GC , without all threads are suspended), in fact, this process also involves the check point (check point) mechanism. Since it does not deviate from this article, we can temporarily understand it as follows

This part of the code is also more complex, and a separate article will follow to explain here.

To summarize, we just add a marker bit to the thread itself, and then wait for its own execution to reach a checkpoint, check that marker bit, and if it’s kSuspendRequest , trigger a hang.

Adding to this, the code flow that actually performs the hang, which may lead to not very coherent knowledge here, but it’s written out first.

void ConditionVariable::WaitHoldingLocks(Thread* self) {
  DCHECK(self == nullptr || self == Thread::Current());  
  guard_.AssertExclusiveHeld(self); 
  unsigned int old_recursion_count = guard_.recursion_count_;

#if ART_USE_FUTEXES 
  num_waiters_++;  
  guard_.increment_contenders(); 
  guard_.recursion_count_ = 1; 
  int32_t cur_sequence = sequence_.load(std::memory_order_relaxed); 
  guard_.ExclusiveUnlock(self);  

  if (futex(sequence_.Address(), FUTEX_WAIT_PRIVATE, cur_sequence, nullptr, nullptr, 0) != 0) {      if ((errno != EINTR) && (errno != EAGAIN)) { 
      PLOG(FATAL) << "futex wait failed for " << name_;  
    }
  }
  SleepIfRuntimeDeleted(self);  
  guard_.ExclusiveLock(self);  
  CHECK_GT(num_waiters_, 0);  
  num_waiters_--;
  CHECK_GT(guard_.get_contenders(), 0); 
  guard_.decrement_contenders(); 


  pid_t old_owner = guard_.GetExclusiveOwnerTid(); 
  guard_.exclusive_owner_.store(0 /* pid */, std::memory_order_relaxed); 
  guard_.recursion_count_ = 0;  
  CHECK_MUTEX_CALL(pthread_cond_wait, (&cond_, &guard_.mutex_)); 
  guard_.exclusive_owner_.store(old_owner, std::memory_order_relaxed); 
#endif
  guard_.recursion_count_ = old_recursion_count; 
}

Reason for hang timeout

ok, we already know that we have executed the ModifySuspendCount() function, but we haven’t actually executed the pending operation yet, we wait until the checkpoint detects the KSuspendRequest flag, then we will actually execute the pending operation, and the timeout is due to the checkpoint execution timeout.

Because these tests are usually triggered at locations that do not affect the state of the program, such as method calls, the end of a loop iteration, or before a return, there may be delays in executing to these locations that cause the checkpoint tests to be delayed.

How do I fix a crash?

The system will hit a log due to the timeout:

ThreadSuspendByPeerWarning(soa, ::android::base::FATAL, "Thread suspension timed out", peer);

static void ThreadSuspendByPeerWarning(ScopedObjectAccess& soa,
                                       LogSeverity severity,
                                       const char* message,
                                       jobject peer) REQUIRES_SHARED(Locks::mutator_lock_) {
  ObjPtr<mirror::Object> name =
      WellKnownClasses::java_lang_Thread_name->GetObject(soa.Decode<mirror::Object>(peer));
  if (name == nullptr) {
    LOG(severity) << message << ": " << peer;
  } else {
    LOG(severity) << message << ": " << peer << ":" << name->AsString()->ToModifiedUtf8();
  }
}

This log level is ::android::base::FATAL and will eventually emit a abort() , causing the process to terminate.

Since it’s impossible to check one by one why the crash thread is postponing the checkpoints, I had to find some other way, so the final solution was to just hook the ThreadSuspendByPeerWarning() function and change the level of LogSeverity from FATAL to INFO or warning before calling it.

sample code (computing)

sys_stub.h

#define SUSPEND_LOG_MSG "Thread suspension timed out"

enum LogSeverity {
    VERBOSE,
    DEBUG,
    INFO,
    WARNING,
    ERROR,
    FATAL_WITHOUT_ABORT,  // For loggability tests, this is considered identical to FATAL.
    FATAL,
};


LogSeverity ToLogSeverity(int logLevel);

const char* getThreadSuspendByPeerWarningFunctionName();

sys_stub.cpp

#include <jni.h>
#include "sys_stub.h"

// Function signatures updated for readability
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_14 "_ZN3artL26ThreadSuspendByPeerWarningERNS_18ScopedObjectAccessEN7android4base11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_8_13 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadEN7android4base11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_6_7 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadENS_11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_5 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadEiPKcP8_jobject"


LogSeverity ToLogSeverity(int logLevel) {
    switch (logLevel) {
        case 0:
            return VERBOSE;
        case 1:
            return DEBUG;
        case 2:
            return INFO;
        case 3:
            return WARNING;
        case 4:
            return ERROR;
        case 5:
            return FATAL_WITHOUT_ABORT;
        case 6:
            return FATAL;
        default:
            return INFO;
    }
}

const char *getThreadSuspendByPeerWarningFunctionName() {
    int apiLevel = android_get_device_api_level();
    // Simplified logic based on Android API levels
    if (apiLevel < 23){
        return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_5;
    } else if (apiLevel < 26) {
        // below android 8
        return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_6_7;
    } else if (apiLevel < 34) {
        // above android 8 and below android 14
        return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_8_13;
    } else {
        // android 14+
        return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_14;
    }
}

com_thread_suspend_hook.cpp

#include <jni.h>
#include <string>
#include <shadowhook.h>
#include <android/log.h>
#include <pthread.h>
#include "sys_stub.h"
#include <android/api-level.h>

#define TARGET_ART_LIB "libart.so"
#define LOG_TAG "thread_suspend_hook"

namespace hookThreadSuspendAbort {
    JavaVM *gVm = nullptr; 
    jobject callbackObj = nullptr; 

    std::atomic<LogSeverity> m_severity{INFO}; 

    void *originalFunction = nullptr; 
    void *stubFunction = nullptr; 

    typedef void (*ThreadSuspendByPeerWarning)(void *self, LogSeverity severity,
                                               const char *message, jobject peer);

    void triggerSuspendTimeout();

    JNIEnv *getJNIEnv(); 

    void hookPointFailed(const char *msg); 

    void cleanup(JNIEnv *env);


    void threadSuspendByPeerWarning(void *self, LogSeverity severity, const char *message,
                                    jobject peer) {
        __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hooked point success : %s", message);
        if (severity == FATAL && strcmp(message, SUSPEND_LOG_MSG) == 0) {

            severity = m_severity.load();
            triggerSuspendTimeout();
        }
        ((ThreadSuspendByPeerWarning) originalFunction)(self, severity, message, peer);
    }

    void maskThreadSuspendTimeout(void *self, LogSeverity severity, const char *message, jobject peer) {
        __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hooked point success : %s", message);
        if (severity == FATAL && strcmp(message, SUSPEND_LOG_MSG) == 0) {              triggerSuspendTimeout();
        }
    }

    void setLogLevel(LogSeverity severity) {
        m_severity.store(severity);
    }

    void releaseHook(); 

    void prepareSetSuspendTimeoutLevel() {
        releaseHook();
        stubFunction = shadowhook_hook_sym_name(TARGET_ART_LIB,
                                                getThreadSuspendByPeerWarningFunctionName(),
                                                (void *) threadSuspendByPeerWarning,
                                                (void **) &originalFunction);
        if (stubFunction == nullptr) {
            const int err_num = shadowhook_get_errno();
            const char *errMsg = shadowhook_to_errmsg(err_num);
            if (errMsg == nullptr || callbackObj == nullptr) {
                return;
            }
            __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup failed: %s", errMsg);
            hookPointFailed(errMsg);
            delete errMsg;
        } else {
            __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup success");
        }
    }

    void preparedMaskThreadTimeoutAbort() {
        releaseHook();
        stubFunction = shadowhook_hook_sym_name(TARGET_ART_LIB,
                                                getThreadSuspendByPeerWarningFunctionName(),
                                                (void *) maskThreadSuspendTimeout,
                                                (void **) &originalFunction);
        if (stubFunction == nullptr) {
            const int err_num = shadowhook_get_errno();
            const char *errMsg = shadowhook_to_errmsg(err_num);
            if (errMsg == nullptr || callbackObj == nullptr) {
                return;
            }
            __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup failed: %s", errMsg);
            hookPointFailed(errMsg);
            delete errMsg;
        } else {
            __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup success");
        }
    }

    void releaseHook() { 

        if (stubFunction != nullptr) {
            shadowhook_unhook(stubFunction);
            stubFunction = nullptr;
        }
    }

    void cleanup(JNIEnv *env) { 

        if (callbackObj) {
            env->DeleteGlobalRef(callbackObj);
            callbackObj = nullptr;
        }
        if (gVm->DetachCurrentThread() != JNI_OK) {
            __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, "Could not detach current thread.");
        }
    }

    JNIEnv *getJNIEnv() { 

        JNIEnv *env = nullptr;
        if (gVm == nullptr) {
            return nullptr;
        }
        jint result = gVm->GetEnv(reinterpret_cast<void **>(&env), JNI_VERSION_1_6);
        if (result == JNI_EDETACHED) {
            if (gVm->AttachCurrentThread(&env, nullptr) != 0) {
                return nullptr;
            }
        } else if (result != JNI_OK) {
            return nullptr;
        }
        return env;
    }

    void hookPointFailed(const char *errMsg) { 

        JNIEnv *pEnv = getJNIEnv();
        if (pEnv == nullptr) {
            return;
        }
        jclass jThreadHookClass = pEnv->FindClass(
                "com/thread_hook/ThreadSuspendTimeoutCallback");
        if (jThreadHookClass != nullptr) {
            jmethodID jMethodId = pEnv->GetMethodID(jThreadHookClass, "onError",
                                                    "(Ljava/lang/String;)V");
            if (jMethodId != nullptr) {
                pEnv->CallVoidMethod(callbackObj, jMethodId, pEnv->NewStringUTF(errMsg));
            }
        }
        cleanup(pEnv);
    }

    void triggerSuspendTimeout() {           JNIEnv *pEnv = getJNIEnv();
        if (pEnv == nullptr) {
            return;
        }
        jclass jThreadHookClass = pEnv->FindClass(
                "com/thread_hook/ThreadSuspendTimeoutCallback");
        if (jThreadHookClass != nullptr) {
            jmethodID jMethodId = pEnv->GetMethodID(jThreadHookClass, "triggerSuspendTimeout",
                                                    "()V");
            if (jMethodId != nullptr) {
                pEnv->CallVoidMethod(callbackObj, jMethodId);
            }
        }
    }
}

JNIEXPORT jint JNI_OnLoad(JavaVM *vm, void *) {       using namespace hookThreadSuspendAbort;
    gVm = vm;
    return JNI_VERSION_1_6;
}

extern "C" JNIEXPORT void JNICALL
Java_com_thread_1hook_ThreadHook_setNativeThreadSuspendTimeoutLogLevel(JNIEnv *env,
                                                                                   jobject,
                                                                                   int logLevel,
                                                                                   jobject callback) {
    using namespace hookThreadSuspendAbort;
    if (callbackObj != nullptr) {
        env->DeleteGlobalRef(callbackObj);
    }
    callbackObj = env->NewGlobalRef(callback);
    setLogLevel(ToLogSeverity(logLevel)); 
    prepareSetSuspendTimeoutLevel();
}


extern "C" JNIEXPORT void JNICALL
Java_com_thread_1hook_ThreadHook_maskNativeThreadSuspendTimeoutAbort(JNIEnv *env,
                                                                                 jobject /*this*/,
                                                                                 jobject callback) {
    using namespace hookThreadSuspendAbort;
    if (callbackObj != nullptr) {
        env->DeleteGlobalRef(callbackObj);
    }
    callbackObj = env->NewGlobalRef(callback);
    preparedMaskThreadTimeoutAbort();
}

The more complicated thing is that there is a multi-version compatibility issue, and the hook function’s mangling name has changed, so it needs to be adapted and tested more.

About how to find the corresponding mangling name you can use the readelf -Ws command to find it, we will not explain it in detail here.

->readelf -Ws libart_android_5_1.so | grep ThreadSuspendByPeerWarning

How do I test for effectiveness?

Since the problem itself is not easy to reproduce, we have to resort to directly executing the ThreadSuspendByPeerWarning() function at some point through the mock code.

#include <jni.h>
#include <shadowhook.h>
#include <dlfcn.h>
#include <android/log.h>
#include "sys_stub.h"

#define TARGET_ART_LIB "libart.so"
#define LOG_TAG "suspend_hook_test"

namespace suspend_hook_test {


    typedef void (*ThreadSuspendByPeerWarning)(void *self,
                                               enum LogSeverity severity,
                                               const char *message,
                                               jobject peer);


    extern "C" JNIEXPORT
    void JNICALL
    Java_com_thread_1hook_ThreadHook_callNativeThreadSuspendTimeout(JNIEnv *env,
                                                                                jobject javaThread /* this */,
                                                                                jlong nativePeer,
                                                                                jobject peer) {
        void *handle = shadowhook_dlopen(TARGET_ART_LIB);
        auto hookPointFunc = (ThreadSuspendByPeerWarning) shadowhook_dlsym(handle,
                                                                           getThreadSuspendByPeerWarningFunctionName());
        if (hookPointFunc != nullptr) {
            void *child_thread = reinterpret_cast<void *>(nativePeer);
            // only 14 worked for test.
            __android_log_print(ANDROID_LOG_INFO, LOG_TAG, "thread_point : %p", child_thread);
            hookPointFunc(child_thread, FATAL, SUSPEND_LOG_MSG, peer);
        } else {
            __android_log_print(ANDROID_LOG_ERROR, LOG_TAG, "ELF symbol not found!");
        }
    }
}

As in the above code, dlsym goes to get the handle and executes the corresponding function directly. Here’s a note:

When the mock function is triggered on the application side, you need to get the address of nativePeer in Thread , which corresponds to the address of Thread.cc in native , through reflection.

object Utils {
    fun getNativePeer(thread: Thread): Long? {
        try {
            val threadClass = Class.forName("java.lang.Thread")
            val nativePeerField: Field = threadClass.getDeclaredField("nativePeer")
            nativePeerField.isAccessible = true
            return nativePeerField.getLong(thread)
        } catch (e: ClassNotFoundException) {
            e.printStackTrace()
        } catch (e: NoSuchFieldException) {
            e.printStackTrace()
        } catch (e: IllegalAccessException) {
            e.printStackTrace()
        }
        return null
    }
}

thread {
    myThread = thread(name = "EdisonLi-init-name") {
        callThreadSuspendTimeout(myThread!!)
        while (true) {
            // Log.d("EdisonLi",  [email protected]?.name.toString())
        }
    }
    while (true) {
        Thread.sleep(1000)
        myThread?.name = "Thread-${Random.nextLong(1, 1000)}"
        break
    }
}

And callThreadSuspendTimeout(myThread!!) must be called by the thread whose name is being changed! Otherwise, it will report an error. OK, after testing in Android14, after executing this function, the process will not be terminated by abort() signal.

As for other versions, this method can’t be called because of the

The first native side of the thread pointer needs to be gotten directly.

Other reproduction programs

We can hook the FromManagedThread() function to sleep in the proxy function for about 5 seconds, after which the subsequent timeout detection will detect the timeout and trigger the ThreadSuspendByPeerWarning() function.

So too can the effectiveness of the program be demonstrated.

Expected risk

To illustrate, if we hook the ThreadSuspendByPeerWarning function, we prevent it from printing ::android::base::FATAL , thus causing the process to exit. There are two cases here.

In the Android 6-12 version, it would just break the spin to return a null pointer to the thread that the caller should have hung after, but returned nullptr .
In the Android 12.1 - 14 version, it will not return nullptr .

Case A: Ends spin detection and returns a nullptr .
Case B: Does not affect spin and waits until the pending thread succeeds.

So, for now, we need to consider two things

Failure to hang causes setName or VMStack.getThreadStackTrace() to return an empty object to Java. The code in Kotlin only affects debug’s ability to get the name of the thread that the current concatenation is attached to, but has no effect for now.
and Does it cause an ANR if it continues to spin and wait for a hang?

As setName or VMStack.getThreadStackTrace() call hang up operation will determine whether they hang up their own, if so it will not trigger the hang up to detect spin, only between two threads to modify each other’s names will trigger the hang up to detect spin, then there is a situation where the main thread to modify the name of the child thread or call VMStack.getThreadStackTrace() , if the timeout time is too long may be ANR , but it is better than crash . To summarize, since the number of such hang timeouts is not very high, the probability of occurrence of the above is not very high.

Currently there is no problem using the above self-testing process, it is still in the testing phase, sharing it in advance is for everyone to think together about the feasibility of this solution, if you have a better solution or if there are any problems with the above, please advise! Thank you very much.

While the current solution reduces the Native Crash caused by thread hangs, there is still a need to look further into thread and co-thread management strategies (not sure if there is a problem with the use of poses >_<) in order to solve the problem once and for all and improve the stability and performance of the system.

Through this in-depth analysis, we have not only solved a long-standing problem, but also enhanced our understanding of Android’s underlying thread management mechanism, which will help us to better deal with similar problems in the future.

Android Native Crash – Thread Hanging Timeout Issue

stack analysis (computing)

Thread.setName()

concurrent execution process

`Worker` creation

What scenarios do we call `Thread.setName()` frequently ?

Why does Thread hang if I change the thread name or get the stack?

How does the ART VM hang threads?

Thread hang check

Why not always use `usleep()` ?

hook up tag bit

When was the hang performed?

This part of the code is also more complex, and a separate article will follow to explain here.

Reason for hang timeout

How do I fix a crash?

sample code (computing)

sys_stub.h

sys_stub.cpp

com_thread_suspend_hook.cpp

How do I test for effectiveness?

Other reproduction programs

Expected risk

By hbb

Related Post

Leave a Reply Cancel reply

You Missed

8 Python practical scripts, save them for future use!

Python logging library logging summary – probably the best article summarizing the logging library so far

I hear you know Python?

An article on collection manipulation functions in Kotlin

stack analysis (computing)

Thread.setName()

concurrent execution process

Worker creation

What scenarios do we call Thread.setName() frequently ?

Why does Thread hang if I change the thread name or get the stack?

How does the ART VM hang threads?

Thread hang check

Why not always use usleep() ?

hook up tag bit

When was the hang performed?

This part of the code is also more complex, and a separate article will follow to explain here.

Reason for hang timeout

How do I fix a crash?

sample code (computing)

sys_stub.h

sys_stub.cpp

com_thread_suspend_hook.cpp

How do I test for effectiveness?

Other reproduction programs

Expected risk

By hbb

Related Post

Leave a Reply Cancel reply

You Missed

`Worker` creation

What scenarios do we call `Thread.setName()` frequently ?

Why not always use `usleep()` ?