Threads are considered heavyweight exactly because they are implemented in the kernel. Every context switch from one thread to another requires interaction with the kernel. This is why modern languages (e.g. Go) create co-routines in user space, scheduling them onto OS threads in the run-time system.
This hybrid setup allows further simplifications. E.g. co-routines can perform cooperative multitasking, yielding to the scheduler only when they encounter a blocking operation (as defined by the language). They will still use multiple cores when the internal scheduler uses multiple threads to run its co-routines, without bothering the kernel to schedule tens of thousands of threads and keeping track of which ones are blocked.