This year, I have been working on thread-related performance optimization, such as thread convergence, thread stack optimization, and some OOM
problems caused by threads. Recently, when searching the crash disk, I found some Native Crash
problems caused by thread hangs, and found that this problem has existed for a long time, but the amount is not very large, belonging to the long-tailed problem, so I spent my energy to study it, and came up with some solutions, which I will discuss and share.
stack analysis (computing)
- Case 1.
// Crash thread
signal:6 (SIGABRT),code:-1 (SI_QUEUE),fault addr:--------
Abort message:
Thread suspension timed out: 0x6f2e45d888:OkHttp https://dummy.global.com/...
backtrace:
// ignore more data
java stacktrace:
at dalvik.system.VMStack.getThreadStackTrace(VMStack.java)
at java.lang.Thread.getStackTrace(Thread.java:1841)
at java.lang.Thread.getAllStackTraces(Thread.java:1909)
at com.appsflyer.internal.AFa1xSDK$23740.AFInAppEventType(AFa1xSDK.java:113)
at com.appsflyer.internal.AFa1xSDK$23740.values(AFa1xSDK.java:168)
at com.appsflyer.internal.AFa1xSDK$23740.AFInAppEventParameterName(AFa1xSDK.java:73)
at com.appsflyer.internal.AFa1tSDK$28986.AFKeystoreWrapper(AFa1tSDK.java:38)
at java.lang.reflect.Method.invoke(Method.java)
at com.appsflyer.internal.AFc1oSDK.AFKeystoreWrapper(AFc1oSDK.java:159)
at com.appsflyer.internal.AFd1hSDK.values(AFd1hSDK.java:88)
at com.appsflyer.internal.AFd1oSDK.valueOf(AFd1oSDK.java:144)
at com.appsflyer.internal.AFd1zSDK.afErrorLog(AFd1zSDK.java:207)
at com.appsflyer.internal.AFc1bSDK.run(AFc1bSDK.java:4184)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:487)
at java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:644)
at java.lang.Thread.run(Thread.java:1012)
- Case 2.
// Crash thread
signal:6 (SIGABRT),code:-1 (SI_QUEUE),fault addr:--------
Abort message:
Thread suspension timed out: 0x70a383f4d8:DefaultDispatcher-worker-3
backtrace:
#00 pc 00000000000896fc /apex/com.android.runtime/lib64/bionic/libc.so (abort+180)
#01 pc 000000000076fc20 /apex/com.android.art/lib64/libart.so (art::Runtime::Abort(char const*)+904)
#02 pc 00000000000357d0 /apex/com.android.art/lib64/libbase.so (android::base::SetAborter(std::__1::function<void (char const*)>&&)::$_0::__invoke(char const*)+80)
#03 pc 0000000000034d58 /apex/com.android.art/lib64/libbase.so (android::base::LogMessage::~LogMessage()+352)
#04 pc 000000000079bac0 /apex/com.android.art/lib64/libart.so (art::ThreadSuspendByPeerWarning(art::ScopedObjectAccess&, android::base::LogSeverity, char const*, _jobject*).__uniq.215660552210357940630679712151551015321+288)
#05 pc 000000000024c838 /apex/com.android.art/lib64/libart.so (art::ThreadList::SuspendThreadByPeer(_jobject*, art::SuspendReason, bool*)+3236)
#06 pc 00000000005949e8 /apex/com.android.art/lib64/libart.so (art::Thread_setNativeName(_JNIEnv*, _jobject*, _jstring*).__uniq.300150332875289415499171563183413458937+744)
#07 pc 0000000000439460 /data/misc/apexdata/com.android.art/dalvik-cache/arm64/boot.oat (art_jni_trampoline+128)
// ingore more data
java stacktrace:
at java.lang.Thread.setNativeName(Thread.java)
at java.lang.Thread.setName(Thread.java:1383)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.setIndexInArray(CoroutineScheduler.java:588)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryTerminateWorker(CoroutineScheduler.java:842)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.park(CoroutineScheduler.java:800)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.tryPark(CoroutineScheduler.java:740)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.runWorker(CoroutineScheduler.java:711)
at kotlinx.coroutines.scheduling.CoroutineScheduler$Worker.run(CoroutineScheduler.java:664)
The above is the log dumped when the thread crashes, in which you can see the Java log, so it’s relatively good to analyze the timing of the crash. To summarize all the problems caused by thread hangs, there are two categories.
- Appsflyer
VMStack.getThreadStackTrace()
- Coroutine
Thread.setName()
The above two method calls each triggered a Linux termination signal at abort()
, which caused the App to crash. Next, let’s analyze the flow of triggering the abort()
signal in turn.
Thread.setName()
Based on the above stack logs we find that the modification of the thread name was triggered by a concurrent thread, so let’s take a look at that. First let’s trace what the concatenation did in the process of switching the scheduler while executing the task.
concurrent execution process
In Kotlin, a concatenation and a thread are two different concepts. Concurrencies are executed on the JVM through threads, but they are not directly bound to any particular thread. Multiple concurrent threads can run on a single thread, or they can flexibly switch between threads. This design allows a concatenation to hang while waiting for, for example, an I/O operation to complete without blocking the thread it is in, so that other concatenations can continue to execute on that thread.
The relationship between a concatenation and a thread in Kotlin is more of an abstraction than a direct dependency. Concurrent threads are controlled by dispatchers that control which threads or thread pools they execute on. For example, Dispatchers.Default
is intended for CPU-intensive tasks and uses a shared thread pool by default, while Dispatchers.IO
is optimized for I/O operations and also operates on a shared thread pool. In the Koltin protocol, the concept of a thread can be called Worker
.
Worker
creation
When we use the following co-programming code, we create an IO scheduler for doing network requests and other events, which triggers the creation process of Worker
.
fun doSomething(){
viewmodelScope.launch(Dispatchers.IO){
// do something...
}
}
For both Dispatcher.IO
& Dispatcher.Default
internally CoroutineScheduler
is used as the thread pool implementation.
The creation of Worker
(threads) is built into the concatenation CoroutineScheduler
.
private fun createNewWorker(): Int {
synchronized(workers) {
// Make sure we're not trying to resurrect terminated scheduler
if (isTerminated) return -1
val state = controlState.value
val created = createdWorkers(state)
val blocking = blockingTasks(state)
val cpuWorkers = (created - blocking).coerceAtLeast(0)
// Double check for overprovision
if (cpuWorkers >= corePoolSize) return 0
if (created >= maxPoolSize) return 0
// start & register new worker, commit index only after successful creation
val newIndex = createdWorkers + 1
require(newIndex > 0 && workers[newIndex] == null)
/*
* 1) Claim the slot (under a lock) by the newly created worker
* 2) Make it observable by increment created workers count
* 3) Only then start the worker, otherwise it may miss its own creation
*/
val worker = Worker(newIndex)
workers.setSynchronized(newIndex, worker)
require(newIndex == incrementCreatedWorkers())
worker.start()
return cpuWorkers + 1
}
}
The above code is calculating a series of quantitative judgments, and ultimately if a Worker
needs to be created it will initialize the Worker
object, and then it will call Thread.start()
.
internal inner class Worker private constructor() : Thread() {
init {
isDaemon = true
}
// guarded by scheduler lock, index in workers array, 0 when not in array (terminated)
@Volatile // volatile for push/pop operation into parkedWorkersStack
var indexInArray = 0
set(index) {
name = "$schedulerName-worker-${if (index == 0) "TERMINATED" else index.toString()}"
field = index
}
constructor(index: Int) : this() {
indexInArray = index
}
}
OK, here we can see Worker
. There is a member variable at indexInArray
, and the set() method is used to change the thread name, so check out the thread Github repository issue to see why we need to change the thread name here.
What scenarios do we call Thread.setName()
frequently ?
We already know that the concurrent scheduler actually does its own thread pooling logic, and how the threads are created internally are encapsulated in the CoroutineScheduler
class. At this point, let’s go back and look at the crash log and what is the order of execution of this Thread.setName()
.
The above flowchart can be understood as when we create more than one task
thread hanging event, this task
will go to the thread pool to find a worker
, if not, it will be worker
, when the thread pool internal spin checking task
& worker
number of state, if there is no task
and the number of woker
exceeds the number of core worker threads, then it will be recycled threads, so there exists a method tryPark
to terminate the thread, and after that, when the termination of threads, you have to synchronize to change the name of the corresponding worker
. When the thread is terminated, the name of the corresponding AtomicReferenceArray
should be changed synchronously, because the overall data structure of worker
exists as an array of , and then the value of index will be reduced by 1 in turn.
Why does Thread hang if I change the thread name or get the stack?
Understanding the scenario in which the problem occurs, let’s look at the cause of the hanging thread.
static void Thread_setNativeName(JNIEnv* env, jobject peer, jstring java_name) {
ScopedUtfChars name(env, java_name);
{
ScopedObjectAccess soa(env);
if (soa.Decode<mirror::Object>(peer) == soa.Self()->GetPeer()) {
// 1.
soa.Self()->SetThreadName(name.c_str());
return;
}
}
// Suspend thread to avoid it from killing itself while we set its name. We don't just hold the
// thread list lock to avoid this, as setting the thread name causes mutator to lock/unlock
// in the DDMS send code.
ThreadList* thread_list = Runtime::Current()->GetThreadList();
// Take suspend thread lock to avoid races with threads trying to suspend this one.
// 2.
Thread* thread = thread_list->SuspendThreadByPeer(peer, SuspendReason::kInternal);
if (thread != nullptr) {
{
ScopedObjectAccess soa(env);
thread->SetThreadName(name.c_str());
}
bool resumed = thread_list->Resume(thread, SuspendReason::kInternal);
DCHECK(resumed);
}
}
The above code is on the native
side of the threaded call to Thread.setNativeName()
in java
via JNI
eventually.
Code 1 is used to determine if the thread is changing its own name, if so, change the name directly, no need to hang.
In code 2, ifA
changes the name ofB
, you need to hangB
and then change the thread name.
A note also exists in the source code about why it should be suspended first. Modifying the name of a thread in a multithreaded environment involves synchronization and management of the thread’s state, and directly modifying the name of an active thread may cause problems with the thread’s own state or with its interaction with other threads. Therefore, suspending a thread and safely modifying its name before resuming its operation is a necessary measure to ensure thread security.
How does the ART VM hang threads?
Thread hang check
Next, let’s look further into the details of the hang to go over how the SuspendThreadByPeer()
function is implemented.
static constexpr useconds_t kThreadSuspendInitialSleepUs = 0;
static constexpr useconds_t kThreadSuspendMaxYieldUs = 3000;
static constexpr useconds_t kThreadSuspendMaxSleepUs = 5000;
Thread* ThreadList::SuspendThreadByPeer(jobject peer,
SuspendReason reason,
bool* timed_out) {
bool request_suspension = true;
const uint64_t start_time = NanoTime();
int self_suspend_count = 0;
useconds_t sleep_us = kThreadSuspendInitialSleepUs;
*timed_out = false;
Thread* const self = Thread::Current();
Thread* suspended_thread = nullptr;
VLOG(threads) << "SuspendThreadByPeer starting";
while (true) {
Thread* thread;
{
ScopedObjectAccess soa(self);
MutexLock thread_list_mu(self, *Locks::thread_list_lock_);
thread = Thread::FromManagedThread(soa, peer);
if (thread == nullptr) {
if (suspended_thread != nullptr) {
MutexLock suspend_count_mu(self, *Locks::thread_suspend_count_lock_);
bool updated = suspended_thread->ModifySuspendCount(soa.Self(),
-1,
nullptr,
reason);
DCHECK(updated);
}
ThreadSuspendByPeerWarning(soa,
::android::base::WARNING,
"No such thread for suspend",
peer);
return nullptr;
}
if (!Contains(thread)) {
CHECK(suspended_thread == nullptr);
VLOG(threads) << "SuspendThreadByPeer failed for unattached thread: "
<< reinterpret_cast<void*>(thread);
return nullptr;
}
VLOG(threads) << "SuspendThreadByPeer found thread: " << *thread;
{
MutexLock suspend_count_mu(self, *Locks::thread_suspend_count_lock_);
if (request_suspension) {
if (self->GetSuspendCount() > 0) {
++self_suspend_count;
continue;
}
CHECK(suspended_thread == nullptr);
suspended_thread = thread;
bool updated = suspended_thread->ModifySuspendCount(self, +1, nullptr, reason);
DCHECK(updated);
request_suspension = false;
} else {
CHECK_GT(thread->GetSuspendCount(), 0);
}
CHECK_NE(thread, self) << "Attempt to suspend the current thread for the debugger";
if (thread->IsSuspended()) {
VLOG(threads) << "SuspendThreadByPeer thread suspended: " << *thread;
if (ATraceEnabled()) {
std::string name;
thread->GetThreadName(name);
ATraceBegin(StringPrintf("SuspendThreadByPeer suspended %s for peer=%p", name.c_str(),
peer).c_str());
}
return thread;
}
const uint64_t total_delay = NanoTime() - start_time;
if (total_delay >= thread_suspend_timeout_ns_)
if (suspended_thread == nullptr) {
ThreadSuspendByPeerWarning(soa,
::android::base::FATAL,
"Failed to issue suspend request",
peer);
} else {
CHECK_EQ(suspended_thread, thread);
LOG(WARNING) << "Suspended thread state_and_flags: "
<< suspended_thread->StateAndFlagsAsHexString()
<< ", self_suspend_count = " << self_suspend_count;
Locks::thread_suspend_count_lock_->Unlock(self);
ThreadSuspendByPeerWarning(soa,
::android::base::FATAL,
"Thread suspension timed out",
peer);
}
UNREACHABLE();
} else if (sleep_us == 0 &&
total_delay > static_cast<uint64_t>(kThreadSuspendMaxYieldUs) * 1000)
sleep_us = kThreadSuspendMaxYieldUs / 2;
}
}
}
VLOG(threads) << "SuspendThreadByPeer waiting to allow thread chance to suspend";
ThreadSuspendSleep(sleep_us); sleep_us = std::min(sleep_us * 2, kThreadSuspendMaxSleepUs);
}
}
Each line of the above code is commented, and the logic is better understood at its core:
The spin wait is especially like the Handler + Looper
mechanism in that it uses a dead loop + hibernation (hang). The hibernation method uses ThreadSuspendSleep(sleep_us)
to hibernate.
static void ThreadSuspendSleep(useconds_t delay_us) {
if (delay_us == 0) {
sched_yield();
} else {
usleep(delay_us);
}
}
sched_yield()
#include <sched.h>
#include "syscall.h"
int sched_yield()
{
return syscall(SYS_sched_yield);
}
Lets the scheduler give up the remaining time slice of the current thread, but it does not change the state of the thread, giving up the current CPU
, which is immediately available to other threads, and the current thread remains in the ready state.
syscall
is a low-level function common in Linux and other UNIX-like operating systems for initiating system calls directly from userspace.
usleep(delay_us)
#include <time.h>
#include "syscall.h"
int nanosleep(const struct timespec *req, struct timespec *rem)
{
return syscall_cp(SYS_nanosleep, req, rem);
}
This sleep method changes the thread state and also lets out CPU
.
Why not always use usleep()
?
Resource utilization and responsiveness:sched_yield()
improves the responsiveness and resource utilization of the system. It allows the current thread to voluntarily yield the CPU without leaving the ready state, which means it can continue execution as soon as an execution opportunity arises. This is very useful in highly concurrent environments to reduce wait times and increase system throughput.
Avoid unnecessary delays: Usingusleep()
means that the current thread must wait for the specified amount of time before it can continue executing even if there are no other threads in the system that need to be run, which can lead to unnecessary delays.
hook up tag bit
The code in the hang check above has two key pieces of code that both point to the same function.
suspended_thread->ModifySuspendCount(self, +1, nullptr, reason);
This suspend_thread
corresponds to Thread.cc
. Let’s check it out:
bool Thread::ModifySuspendCountInternal(Thread* self,
int delta,
AtomicInteger* suspend_barrier,
SuspendReason reason) {
if (kIsDebugBuild) {
DCHECK(delta == -1 || delta == +1)
<< reason << " " << delta << " " << this;
Locks::thread_suspend_count_lock_->AssertHeld(self);
if (this != self && !IsSuspended()) {
Locks::thread_list_lock_->AssertHeld(self);
}
}
if (UNLIKELY(reason == SuspendReason::kForUserCode)) {
Locks::user_code_suspension_lock_->AssertHeld(self);
if (UNLIKELY(delta + tls32_.user_code_suspend_count < 0)) {
LOG(ERROR) << "attempting to modify suspend count in an illegal way.";
return false;
}
}
if (UNLIKELY(delta < 0 && tls32_.suspend_count <= 0)) {
UnsafeLogFatalForSuspendCount(self, this);
return false;
}
if (delta > 0 && this != self && tlsPtr_.flip_function != nullptr) {
return false;
}
uint32_t flags = enum_cast<uint32_t>(ThreadFlag::kSuspendRequest);
if (delta > 0 && suspend_barrier != nullptr) {
uint32_t available_barrier = kMaxSuspendBarriers;
for (uint32_t i = 0; i < kMaxSuspendBarriers; ++i) {
if (tlsPtr_.active_suspend_barriers[i] == nullptr) {
available_barrier = i;
break;
}
}
if (available_barrier == kMaxSuspendBarriers) {
return false;
}
tlsPtr_.active_suspend_barriers[available_barrier] = suspend_barrier;
flags |= enum_cast<uint32_t>(ThreadFlag::kActiveSuspendBarrier);
}
tls32_.suspend_count += delta;
switch (reason) {
case SuspendReason::kForUserCode:
tls32_.user_code_suspend_count += delta;
break;
case SuspendReason::kInternal:
break;
}
if (tls32_.suspend_count == 0) {
AtomicClearFlag(ThreadFlag::kSuspendRequest);
} else { tls32_.state_and_flags.fetch_or(flags, std::memory_order_seq_cst);
TriggerSuspend();
}
return true;
}
ModifySuspendCount()
The function will eventually execute ModifySuspendCountInternal()
, the core lies in the code to set up the hang barrier, in fact, it is to set up a hang start point for tlsPtr_
, when suspend_count > 0
indicates that the current thread needs to be hung, but just set a flag, is not very much like in the handler
mechanism?
When was the hang performed?
Here we have to talk about the Android checkpoint mechanism, remember our GC
process? For example, when we execute System.GC
, will GC
be triggered? Interview veterans surely know that no, we must wait until all threads have reached a safe point before triggering the GC, then trigger the GC when you need to carry out Stop the World
(of course, ART
using GC
, without all threads are suspended), in fact, this process also involves the check point (check point) mechanism. Since it does not deviate from this article, we can temporarily understand it as follows
This part of the code is also more complex, and a separate article will follow to explain here.
To summarize, we just add a marker bit to the thread itself, and then wait for its own execution to reach a checkpoint, check that marker bit, and if it’s kSuspendRequest
, trigger a hang.
Adding to this, the code flow that actually performs the hang, which may lead to not very coherent knowledge here, but it’s written out first.
void ConditionVariable::WaitHoldingLocks(Thread* self) {
DCHECK(self == nullptr || self == Thread::Current());
guard_.AssertExclusiveHeld(self);
unsigned int old_recursion_count = guard_.recursion_count_;
#if ART_USE_FUTEXES
num_waiters_++;
guard_.increment_contenders();
guard_.recursion_count_ = 1;
int32_t cur_sequence = sequence_.load(std::memory_order_relaxed);
guard_.ExclusiveUnlock(self);
if (futex(sequence_.Address(), FUTEX_WAIT_PRIVATE, cur_sequence, nullptr, nullptr, 0) != 0) { if ((errno != EINTR) && (errno != EAGAIN)) {
PLOG(FATAL) << "futex wait failed for " << name_;
}
}
SleepIfRuntimeDeleted(self);
guard_.ExclusiveLock(self);
CHECK_GT(num_waiters_, 0);
num_waiters_--;
CHECK_GT(guard_.get_contenders(), 0);
guard_.decrement_contenders();
pid_t old_owner = guard_.GetExclusiveOwnerTid();
guard_.exclusive_owner_.store(0 /* pid */, std::memory_order_relaxed);
guard_.recursion_count_ = 0;
CHECK_MUTEX_CALL(pthread_cond_wait, (&cond_, &guard_.mutex_));
guard_.exclusive_owner_.store(old_owner, std::memory_order_relaxed);
#endif
guard_.recursion_count_ = old_recursion_count;
}
Reason for hang timeout
ok, we already know that we have executed the ModifySuspendCount()
function, but we haven’t actually executed the pending operation yet, we wait until the checkpoint detects the KSuspendRequest
flag, then we will actually execute the pending operation, and the timeout is due to the checkpoint execution timeout.
Because these tests are usually triggered at locations that do not affect the state of the program, such as method calls, the end of a loop iteration, or before a return, there may be delays in executing to these locations that cause the checkpoint tests to be delayed.
How do I fix a crash?
The system will hit a log due to the timeout:
ThreadSuspendByPeerWarning(soa, ::android::base::FATAL, "Thread suspension timed out", peer);
static void ThreadSuspendByPeerWarning(ScopedObjectAccess& soa,
LogSeverity severity,
const char* message,
jobject peer) REQUIRES_SHARED(Locks::mutator_lock_) {
ObjPtr<mirror::Object> name =
WellKnownClasses::java_lang_Thread_name->GetObject(soa.Decode<mirror::Object>(peer));
if (name == nullptr) {
LOG(severity) << message << ": " << peer;
} else {
LOG(severity) << message << ": " << peer << ":" << name->AsString()->ToModifiedUtf8();
}
}
This log level is ::android::base::FATAL
and will eventually emit a abort()
, causing the process to terminate.
Since it’s impossible to check one by one why the crash thread is postponing the checkpoints, I had to find some other way, so the final solution was to just hook the ThreadSuspendByPeerWarning()
function and change the level of LogSeverity
from FATAL
to INFO
or warning
before calling it.
sample code (computing)
sys_stub.h
#define SUSPEND_LOG_MSG "Thread suspension timed out"
enum LogSeverity {
VERBOSE,
DEBUG,
INFO,
WARNING,
ERROR,
FATAL_WITHOUT_ABORT, // For loggability tests, this is considered identical to FATAL.
FATAL,
};
LogSeverity ToLogSeverity(int logLevel);
const char* getThreadSuspendByPeerWarningFunctionName();
sys_stub.cpp
#include <jni.h>
#include "sys_stub.h"
// Function signatures updated for readability
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_14 "_ZN3artL26ThreadSuspendByPeerWarningERNS_18ScopedObjectAccessEN7android4base11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_8_13 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadEN7android4base11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_6_7 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadENS_11LogSeverityEPKcP8_jobject"
#define SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_5 "_ZN3artL26ThreadSuspendByPeerWarningEPNS_6ThreadEiPKcP8_jobject"
LogSeverity ToLogSeverity(int logLevel) {
switch (logLevel) {
case 0:
return VERBOSE;
case 1:
return DEBUG;
case 2:
return INFO;
case 3:
return WARNING;
case 4:
return ERROR;
case 5:
return FATAL_WITHOUT_ABORT;
case 6:
return FATAL;
default:
return INFO;
}
}
const char *getThreadSuspendByPeerWarningFunctionName() {
int apiLevel = android_get_device_api_level();
// Simplified logic based on Android API levels
if (apiLevel < 23){
return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_5;
} else if (apiLevel < 26) {
// below android 8
return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_6_7;
} else if (apiLevel < 34) {
// above android 8 and below android 14
return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_8_13;
} else {
// android 14+
return SYMBOL_THREAD_SUSPEND_BY_PEER_WARNING_14;
}
}
com_thread_suspend_hook.cpp
#include <jni.h>
#include <string>
#include <shadowhook.h>
#include <android/log.h>
#include <pthread.h>
#include "sys_stub.h"
#include <android/api-level.h>
#define TARGET_ART_LIB "libart.so"
#define LOG_TAG "thread_suspend_hook"
namespace hookThreadSuspendAbort {
JavaVM *gVm = nullptr;
jobject callbackObj = nullptr;
std::atomic<LogSeverity> m_severity{INFO};
void *originalFunction = nullptr;
void *stubFunction = nullptr;
typedef void (*ThreadSuspendByPeerWarning)(void *self, LogSeverity severity,
const char *message, jobject peer);
void triggerSuspendTimeout();
JNIEnv *getJNIEnv();
void hookPointFailed(const char *msg);
void cleanup(JNIEnv *env);
void threadSuspendByPeerWarning(void *self, LogSeverity severity, const char *message,
jobject peer) {
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hooked point success : %s", message);
if (severity == FATAL && strcmp(message, SUSPEND_LOG_MSG) == 0) {
severity = m_severity.load();
triggerSuspendTimeout();
}
((ThreadSuspendByPeerWarning) originalFunction)(self, severity, message, peer);
}
void maskThreadSuspendTimeout(void *self, LogSeverity severity, const char *message, jobject peer) {
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hooked point success : %s", message);
if (severity == FATAL && strcmp(message, SUSPEND_LOG_MSG) == 0) { triggerSuspendTimeout();
}
}
void setLogLevel(LogSeverity severity) {
m_severity.store(severity);
}
void releaseHook();
void prepareSetSuspendTimeoutLevel() {
releaseHook();
stubFunction = shadowhook_hook_sym_name(TARGET_ART_LIB,
getThreadSuspendByPeerWarningFunctionName(),
(void *) threadSuspendByPeerWarning,
(void **) &originalFunction);
if (stubFunction == nullptr) {
const int err_num = shadowhook_get_errno();
const char *errMsg = shadowhook_to_errmsg(err_num);
if (errMsg == nullptr || callbackObj == nullptr) {
return;
}
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup failed: %s", errMsg);
hookPointFailed(errMsg);
delete errMsg;
} else {
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup success");
}
}
void preparedMaskThreadTimeoutAbort() {
releaseHook();
stubFunction = shadowhook_hook_sym_name(TARGET_ART_LIB,
getThreadSuspendByPeerWarningFunctionName(),
(void *) maskThreadSuspendTimeout,
(void **) &originalFunction);
if (stubFunction == nullptr) {
const int err_num = shadowhook_get_errno();
const char *errMsg = shadowhook_to_errmsg(err_num);
if (errMsg == nullptr || callbackObj == nullptr) {
return;
}
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup failed: %s", errMsg);
hookPointFailed(errMsg);
delete errMsg;
} else {
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "Hook setup success");
}
}
void releaseHook() {
if (stubFunction != nullptr) {
shadowhook_unhook(stubFunction);
stubFunction = nullptr;
}
}
void cleanup(JNIEnv *env) {
if (callbackObj) {
env->DeleteGlobalRef(callbackObj);
callbackObj = nullptr;
}
if (gVm->DetachCurrentThread() != JNI_OK) {
__android_log_print(ANDROID_LOG_ERROR, LOG_TAG, "Could not detach current thread.");
}
}
JNIEnv *getJNIEnv() {
JNIEnv *env = nullptr;
if (gVm == nullptr) {
return nullptr;
}
jint result = gVm->GetEnv(reinterpret_cast<void **>(&env), JNI_VERSION_1_6);
if (result == JNI_EDETACHED) {
if (gVm->AttachCurrentThread(&env, nullptr) != 0) {
return nullptr;
}
} else if (result != JNI_OK) {
return nullptr;
}
return env;
}
void hookPointFailed(const char *errMsg) {
JNIEnv *pEnv = getJNIEnv();
if (pEnv == nullptr) {
return;
}
jclass jThreadHookClass = pEnv->FindClass(
"com/thread_hook/ThreadSuspendTimeoutCallback");
if (jThreadHookClass != nullptr) {
jmethodID jMethodId = pEnv->GetMethodID(jThreadHookClass, "onError",
"(Ljava/lang/String;)V");
if (jMethodId != nullptr) {
pEnv->CallVoidMethod(callbackObj, jMethodId, pEnv->NewStringUTF(errMsg));
}
}
cleanup(pEnv);
}
void triggerSuspendTimeout() { JNIEnv *pEnv = getJNIEnv();
if (pEnv == nullptr) {
return;
}
jclass jThreadHookClass = pEnv->FindClass(
"com/thread_hook/ThreadSuspendTimeoutCallback");
if (jThreadHookClass != nullptr) {
jmethodID jMethodId = pEnv->GetMethodID(jThreadHookClass, "triggerSuspendTimeout",
"()V");
if (jMethodId != nullptr) {
pEnv->CallVoidMethod(callbackObj, jMethodId);
}
}
}
}
JNIEXPORT jint JNI_OnLoad(JavaVM *vm, void *) { using namespace hookThreadSuspendAbort;
gVm = vm;
return JNI_VERSION_1_6;
}
extern "C" JNIEXPORT void JNICALL
Java_com_thread_1hook_ThreadHook_setNativeThreadSuspendTimeoutLogLevel(JNIEnv *env,
jobject,
int logLevel,
jobject callback) {
using namespace hookThreadSuspendAbort;
if (callbackObj != nullptr) {
env->DeleteGlobalRef(callbackObj);
}
callbackObj = env->NewGlobalRef(callback);
setLogLevel(ToLogSeverity(logLevel));
prepareSetSuspendTimeoutLevel();
}
extern "C" JNIEXPORT void JNICALL
Java_com_thread_1hook_ThreadHook_maskNativeThreadSuspendTimeoutAbort(JNIEnv *env,
jobject /*this*/,
jobject callback) {
using namespace hookThreadSuspendAbort;
if (callbackObj != nullptr) {
env->DeleteGlobalRef(callbackObj);
}
callbackObj = env->NewGlobalRef(callback);
preparedMaskThreadTimeoutAbort();
}
The more complicated thing is that there is a multi-version compatibility issue, and the hook function’s mangling name
has changed, so it needs to be adapted and tested more.
About how to find the corresponding mangling name
you can use the readelf -Ws
command to find it, we will not explain it in detail here.
->readelf -Ws libart_android_5_1.so | grep ThreadSuspendByPeerWarning
How do I test for effectiveness?
Since the problem itself is not easy to reproduce, we have to resort to directly executing the ThreadSuspendByPeerWarning()
function at some point through the mock code.
#include <jni.h>
#include <shadowhook.h>
#include <dlfcn.h>
#include <android/log.h>
#include "sys_stub.h"
#define TARGET_ART_LIB "libart.so"
#define LOG_TAG "suspend_hook_test"
namespace suspend_hook_test {
typedef void (*ThreadSuspendByPeerWarning)(void *self,
enum LogSeverity severity,
const char *message,
jobject peer);
extern "C" JNIEXPORT
void JNICALL
Java_com_thread_1hook_ThreadHook_callNativeThreadSuspendTimeout(JNIEnv *env,
jobject javaThread /* this */,
jlong nativePeer,
jobject peer) {
void *handle = shadowhook_dlopen(TARGET_ART_LIB);
auto hookPointFunc = (ThreadSuspendByPeerWarning) shadowhook_dlsym(handle,
getThreadSuspendByPeerWarningFunctionName());
if (hookPointFunc != nullptr) {
void *child_thread = reinterpret_cast<void *>(nativePeer);
// only 14 worked for test.
__android_log_print(ANDROID_LOG_INFO, LOG_TAG, "thread_point : %p", child_thread);
hookPointFunc(child_thread, FATAL, SUSPEND_LOG_MSG, peer);
} else {
__android_log_print(ANDROID_LOG_ERROR, LOG_TAG, "ELF symbol not found!");
}
}
}
As in the above code, dlsym
goes to get the handle and executes the corresponding function directly. Here’s a note:
When the mock function is triggered on the application side, you need to get the address of nativePeer
in Thread
, which corresponds to the address of Thread.cc
in native
, through reflection.
object Utils {
fun getNativePeer(thread: Thread): Long? {
try {
val threadClass = Class.forName("java.lang.Thread")
val nativePeerField: Field = threadClass.getDeclaredField("nativePeer")
nativePeerField.isAccessible = true
return nativePeerField.getLong(thread)
} catch (e: ClassNotFoundException) {
e.printStackTrace()
} catch (e: NoSuchFieldException) {
e.printStackTrace()
} catch (e: IllegalAccessException) {
e.printStackTrace()
}
return null
}
}
thread {
myThread = thread(name = "EdisonLi-init-name") {
callThreadSuspendTimeout(myThread!!)
while (true) {
// Log.d("EdisonLi", [email protected]?.name.toString())
}
}
while (true) {
Thread.sleep(1000)
myThread?.name = "Thread-${Random.nextLong(1, 1000)}"
break
}
}
And callThreadSuspendTimeout(myThread!!)
must be called by the thread whose name is being changed! Otherwise, it will report an error. OK, after testing in Android14, after executing this function, the process will not be terminated by abort() signal.
As for other versions, this method can’t be called because of the
The first native
side of the thread
pointer needs to be gotten directly.
Other reproduction programs
We can hook the FromManagedThread()
function to sleep in the proxy function for about 5 seconds, after which the subsequent timeout detection will detect the timeout and trigger the ThreadSuspendByPeerWarning()
function.
So too can the effectiveness of the program be demonstrated.
Expected risk
To illustrate, if we hook the ThreadSuspendByPeerWarning
function, we prevent it from printing ::android::base::FATAL
, thus causing the process to exit. There are two cases here.
In theAndroid 6-12
version, it would just break the spin to return a null pointer to the thread that the caller should have hung after, but returnednullptr
.
In theAndroid 12.1 - 14
version, it will notreturn nullptr
.
Case A: Ends spin detection and returns anullptr
.
Case B: Does not affect spin and waits until the pending thread succeeds.
So, for now, we need to consider two things
Failure to hang causessetName
orVMStack.getThreadStackTrace()
to return an empty object to Java. The code in Kotlin only affects debug’s ability to get the name of the thread that the current concatenation is attached to, but has no effect for now.
and Does it cause an ANR if it continues to spin and wait for a hang?
AssetName
orVMStack.getThreadStackTrace()
call hang up operation will determine whether they hang up their own, if so it will not trigger the hang up to detect spin, only between two threads to modify each other’s names will trigger the hang up to detect spin, then there is a situation where the main thread to modify the name of the child thread or callVMStack.getThreadStackTrace()
, if the timeout time is too long may beANR
, but it is better thancrash
. To summarize, since the number of such hang timeouts is not very high, the probability of occurrence of the above is not very high.
Currently there is no problem using the above self-testing process, it is still in the testing phase, sharing it in advance is for everyone to think together about the feasibility of this solution, if you have a better solution or if there are any problems with the above, please advise! Thank you very much.
While the current solution reduces the Native Crash
caused by thread hangs, there is still a need to look further into thread and co-thread management strategies (not sure if there is a problem with the use of poses >_<) in order to solve the problem once and for all and improve the stability and performance of the system.
Through this in-depth analysis, we have not only solved a long-standing problem, but also enhanced our understanding of Android’s underlying thread management mechanism, which will help us to better deal with similar problems in the future.