Cooperative pool deadlock when calling into an opaque subsystem

saagarjha · March 16, 2024, 12:43pm

This is baffling to me. I am more than happy to file feedback if @ole won't, but this cannot be resolved that way! We as developers cannot go "oh I discovered the hard way you use a semaphore in this API, can you please not do that?" There are thousands of places inside of Apple's frameworks where this is the case. This is not something third party developers can fix, and I will bet money that Apple cannot fix it either.

As far as I can tell, Vision submits work to GCD as it should. It then decides to wait on a semaphore for those to complete, which isn't great, but isn't wrong. The concurrent pool has no choice but to support this pattern! There is far too much code out there that relies on this. In fact, my understanding was that it is supposed to work. Why doesn't a new thread get spawned in this case? It seems necessary to interact with anything that tries to block…?

Nobody1707 · March 16, 2024, 1:38pm

As a general rule, nothing is suppose to block the concurrency thread pool. It was not designed for this to work, and the core team has been rather clear about that. The concurrency thread pool was specifically designed to have only as many threads as CPU cores. If you need to block for any significant amount of time, you need to do it in GCD or on a thread you spawned yourself. You might be able to make a custom actor executor to handle that, I'm not sure.

wadetregaskis · March 16, 2024, 2:32pm

What would help here - Apple and 3rd party developers alike - is updates to certain concurrency primitives, like the blocking 'semaphore' that this Vision framework is using, to alert developers to the problem. They need to be instrumented with something like:

func wait() {
    withUnsafeCurrentTask {
        assert(nil == $0, "NSSemaphore cannot be used in Swift Concurrency, as it may cause deadlock.  Use https://github.com/groue/Semaphore instead.")
    }

    …
}

The trick is that you can't catch every type - NSLock, for example, does need to work inside async contexts currently because a lot of async mutual exclusion primitives (like AsyncSemaphore) are actually built atop it. But I would think that for higher-level things, like NSCondition etc, you can get away with the assertion.

saagarjha · March 16, 2024, 4:55pm

This is a non-starter. It’s completely infeasible: there is no way to know what the code you call into is doing. It’s one thing to go “oh you can’t introduce a dependency in your own code across two different tasks” but quite another to say that it a thread in the concurrent pool can never be blocked.

That’s fine. It shouldn’t stop a different queue from scheduling in a new thread, though?

hborla · March 16, 2024, 5:13pm

No need; I filed the feedback when I read the post a few days ago.

This use pattern with semaphores and GCD is a long established anti-pattern. The Clang static analyzer has been detecting and diagnosing this pattern since Xcode 10, because this pattern has always caused performance problems. Swift concurrency aside, I think occurrences of this should always be considered bugs.

The suggested workaround in this discussion thread to invoke the API from an overcommit queue is the right way to work around these bugs in APIs you don't own (e.g. the continuation + DispatchQueue.global() dance). This works because using the continuation will suspend the task, so the current thread is freed up to perform other work, and dispatch then has the option to create another thread to unblock the work if necessary.

This pattern should absolutely be documented, and perhaps an API should go in the Concurrency library so that there's a standard way to write the workaround. I also agree that runtime diagnostics to surface when this has happened would be extremely valuable.

gwendal.roue · March 17, 2024, 1:30pm

@Joe_Groff, @hborla, is there a way for an API that can block, and can't easily be refactored into a less aggressive technique, to check if it is invoked from a thread of the cooperative thread pool? This would make it possible for the library to log a warning, or emit a runtime warning, or even raise a precondition failure, so that the application can be warned of the misuse?

Surely the library should be refactored, but this can take some time, or some language tools may be missing, so it would be cool to be able to at least warn the user.

We have language tools that prevent directly calling sync methods from async contexts, such as async overloads, and SE 0340 - Unavailable From Async, but those can't detect nested calls to apis that don't play well with the cooperative pool:

func asyncFunc() async {
  mySyncWrapper()
}

func mySyncWrapper() {
  // Compiler can't spot this.
  // It would be nice if `dragons` implementor
  // could warn at runtime.
  dragons() 
}

(Note that synchronous calls are not always bad. And I'm not 100% sure that all calls from a thread from the cooperative pool, and only them, should be detected. Maybe it is generally "called from a Task" that is the problem. I'm more trying to ask if libraries could help addressing the topic discussed in this thread.)

EDIT

Having the compiler put in the program the runtime information that could answer the question looks like something that is not needed by the Swift concurrency runtime, thanks to its heavy reliance on static information. I would not be surprised if this information was not currently available... After all I don't know, and that's why I'm asking.

wadetregaskis · March 17, 2024, 2:38pm

Does what I posted earlier not suffice?

func wait() {
    withUnsafeCurrentTask {
        assert(nil == $0, "NSSemaphore cannot be used in Swift Concurrency, as it may cause deadlock.  Use https://github.com/groue/Semaphore instead.")
    }

    …
}

ktoso · March 17, 2024, 2:42pm

No that’s not enough because having a Task doesn’t specifically tell you where it runs. If may be running on a pool dedicated to running blocking operations.

With task executors we made sure you can check if it is a specific one through unsafe calls… but there’s no great story in general for detecting “is this the default pool”. Maybe this is one of the things to try to allow testing for… (for diagnostics primarily)

I’ve been following the thread and thinking about it but the bottom line is that there’s no magical solution here. But yes there can be a series of improvements but I’m still organizing my ideas around this.

tgoyne · March 18, 2024, 12:10am

It's probably not practical to add at this point, but it does seem like noasync should be viral and functions which call noasync functions should be required to either be themselves noasync or make an explicitly unsafe call.

layoutSubviews · March 18, 2024, 12:47am

I feel like a fact missed here is that the Vision framework should never have provided a synchronous API for something which takes a significant amount of time to complete, and is actually asynchronous under the covers.
In other words, the API should have been based around a completion handler from the start. But it's understandable why it didn't: in the pre-async world, introducing asynchronicity placed a significant burden on the consumer of the API, and it was common practice to privilege call convenience in order to make the framework easier to use with the tools at our disposal -- Nothing wrong with that.

So what's my point? In parallel of making Swift more robust against these situations, some effort should also be made to "correct" these frameworks. Any interface that takes a significant amount of time to complete and/or is waiting on an asynchronous result should ideally provide an async variant. Over time, the amount of APIs which unexpectedly block under the covers will decrease, and the overall problem should decrease in scope.
This obviously will not fix this issue now, or even in short-term. But the overall Swift ecosystem would benefit from every framework developer, whether inside Apple or outside, to rely more & more on this amazing tool that is Structured Concurrency, and to let go of the (now harmful) patterns of the past.