ThreadLocal变量的性能

https://stackoverflow.com/questions/609826

03-07-2019
|

题

从ThreadLocal变量读取多少比常规字段慢？

更具体地说，简单的对象创建比访问ThreadLocal<MessageDigest>变量更快还是更慢？

我认为它足够快，以便MessageDigest实例比每次创建<=>实例快得多。但这也适用于字节[10]或字节[1000]，例如？

编辑：问题是调用<=>获取时真正发生的事情？如果那只是一个字段，就像任何其他字段一样，那么答案就是<！>“它总是最快<！>”，对吗？

解决方案

运行未发布的基准测试，ThreadLocal.get在我的机器上每次迭代需要大约35个周期。没什么大不了的。在Sun的实现中，Thread中的自定义线性探测哈希映射将ThreadLocal映射到值。因为它只能由一个线程访问，所以它可以非常快。

小对象的分配需要相似数量的周期，但由于缓存耗尽，您可能会在紧密循环中获得较低的数字。

MessageDigest的构建可能相对昂贵。它具有相当数量的状态和构造通过Provider SPI机制。您可以通过克隆或提供<=>进行优化。

仅仅因为在<=>而不是创建中缓存可能更快，并不一定意味着系统性能会提高。您将获得与GC相关的额外开销，这会减慢一切。

除非您的应用程序非常频繁地使用<=>，否则您可能需要考虑使用传统的线程安全缓存。

其他提示

2009年，一些JVM使用Thread.currentThread（）对象中的非同步HashMap实现了ThreadLocal。这使得它非常快（当然，并不像使用常规字段访问那么快），以及确保在Thread死亡时ThreadLocal对象得到整理。在2016年更新这个答案，似乎大多数（所有？）较新的JVM使用线性探测的ThreadLocalMap。我不确定那些<！>＃8211;但我无法想象它比早期的实施情况要糟糕得多。

当然，新的Object（）现在也非常快，垃圾收集器也非常擅长回收短寿命的物体。

除非你确定对象创建会很昂贵，或者你需要逐个线程地保持一些状态，你最好在需要的解决方案时更简单的分配，并且只切换到ThreadLocal当分析器告诉您需要时实现。

好问题，我最近一直在问自己。为了给你明确的数字，下面的基准测试（在Scala中，编译成几乎与等效Java代码相同的字节码）：

var cnt: String = ""
val tlocal = new java.lang.ThreadLocal[String] {
  override def initialValue = ""
}

def loop_heap_write = {                                                                                                                           
  var i = 0                                                                                                                                       
  val until = totalwork / threadnum                                                                                                               
  while (i < until) {                                                                                                                             
    if (cnt ne "") cnt = "!"                                                                                                                      
    i += 1                                                                                                                                        
  }                                                                                                                                               
  cnt                                                                                                                                          
} 

def threadlocal = {
  var i = 0
  val until = totalwork / threadnum
  while (i < until) {
    if (tlocal.get eq null) i = until + i + 1
    i += 1
  }
  if (i > until) println("thread local value was null " + i)
}

此处，采用AMD 4x 2.8 GHz双核和具有超线程（2.67 GHz）的四核i7进行。

这些是数字：

i7的

规格：Intel i7 2x四核@ 2.67 GHz 测试：scala.threads.ParallelTests

测试名称：loop_heap_read

线程数：1 总测试：200

运行时间:(显示最后5个） 9.0069 9.0036 9.0017 9.0084 9.0074（平均值= 9.1034分钟= 8.9986最大值= 21.0306）

线程数：2 总测试：200

运行时间:(显示最后5个） 4.5563 4.7128 4.5663 4.5617 4.5724（平均值= 4.6337 min = 4.5509 max = 13.9476）

线程数：4 总测试：200

运行时间:(显示最后5个） 2.3946 2.3979 2.3934 2.3937 2.3964（平均值= 2.5113分钟= 2.3884最大值= 13.5496）

线程数：8 总测试：200

运行时间:(显示最后5个） 2.4479 2.4362 2.4323 2.4472 2.4383（平均值= 2.5562 min = 2.4166 max = 10.3726）

测试名称：threadlocal

线程数：1 总测试：200

运行时间:(显示最后5个） 91.1741 90.8978 90.6181 90.6200 90.6113（平均= 91.0291分= 90.6000最大= 129.7501）

线程数：2 总测试：200

运行时间:(显示最后5个） 45.3838 45.3858 45.6676 45.3772 45.3839（平均值= 46.0555最小值= 45.3726最大值= 90.7108）

线程数：4 总测试：200

运行时间:(显示最后5个） 22.8118 22.8135 59.1753 22.8229 22.8172（avg = 23.9752 min = 22.7951 max = 59.1753）

线程数：8 总测试：200

运行时间:(显示最后5个） 22.2965 22.2415 22.3438 22.3109 22.4460（avg = 23.2676 min = 22.2346 max = 50.3583）

AMD

规格：AMD 8220 4x双核@ 2.8 GHz 测试：scala.threads.ParallelTests

测试名称：loop_heap_read

总工作量：20000000 线程数：1 总测试：200

运行时间:(显示最后5个） 12.625 12.631 12.634 12.632 12.628（平均值= 12.7333 min = 12.619 max = 26.698）

测试名称：loop_heap_read 总工作量：20000000

运行时间:(显示最后5个） 6.412 6.424 6.408 6.397 6.43（平均值= 6.5367 min = 6.393 max = 19.716）

线程数：4 总测试：200

运行时间:(显示最后5个） 3.385 4.298 9.7 6.535 3.385（平均值= 5.6079 min = 3.354 max = 21.603）

线程数：8 总测试：200

运行时间:(显示最后5个） 5.389 5.795 10.818 3.823 3.824（平均值= 5.5810 min = 2.405 max = 19.755）

测试名称：threadlocal

线程数：1 总测试：200

运行时间:(显示最后5个） 200.217 207.335 200.241 207.342 200.23（平均值= 202.2424 min = 200.184 max = 245.369）

线程数：2 总测试：200

运行时间:(显示最后5个） 100.208 100.199 100.211 103.781 100.215（平均= 102.2238分= 100.192最大= 129.505）

线程数：4 总测试：200

运行时间:(显示最后5个） 62.101 67.629 62.087 52.021 55.766（平均值= 65.6361分钟= 50.282最大值= 167.433）

线程数：8 总测试：200

运行时间:(显示最后5个） 40.672 74.301 34.434 41.549 28.119（平均值= 54.7701 min = 28.119 max = 94.424）

摘要

本地线程大约是堆读取的10-20倍。它似乎在这个JVM实现和这些具有处理器数量的架构上也能很好地扩展。

这是另一项测试。结果显示ThreadLocal比常规字段慢一点，但顺序相同。 Aprox慢了12％

public class Test {
private static final int N = 100000000;
private static int fieldExecTime = 0;
private static int threadLocalExecTime = 0;

public static void main(String[] args) throws InterruptedException {
    int execs = 10;
    for (int i = 0; i < execs; i++) {
        new FieldExample().run(i);
        new ThreadLocaldExample().run(i);
    }
    System.out.println("Field avg:"+(fieldExecTime / execs));
    System.out.println("ThreadLocal avg:"+(threadLocalExecTime / execs));
}

private static class FieldExample {
    private Map<String,String> map = new HashMap<String, String>();

    public void run(int z) {
        System.out.println(z+"-Running  field sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            map.put(s,"a");
            map.remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        fieldExecTime += t;
        System.out.println(z+"-End field sample:"+t);
    }
}

private static class ThreadLocaldExample{
    private ThreadLocal<Map<String,String>> myThreadLocal = new ThreadLocal<Map<String,String>>() {
        @Override protected Map<String, String> initialValue() {
            return new HashMap<String, String>();
        }
    };

    public void run(int z) {
        System.out.println(z+"-Running thread local sample");
        long start = System.currentTimeMillis();
        for (int i = 0; i < N; i++){
            String s = Integer.toString(i);
            myThreadLocal.get().put(s, "a");
            myThreadLocal.get().remove(s);
        }
        long end = System.currentTimeMillis();
        long t = (end - start);
        threadLocalExecTime += t;
        System.out.println(z+"-End thread local sample:"+t);
    }
}
}'

输出：

0-运行场样本

0-结束田间样本：6044

0-运行线程本地样本

0-结束线程本地样本：6015

1-Running field sample

1-End野外样本：5095

1 - 运行线程本地样本

1-End线程本地样本：5720

2-Running field sample

2-End野外样本：4842

2 - 运行线程本地样本

2-End thread local local sample：5835

3-Running field sample

3-End野外样本：4674

3 - 运行线程本地样本

3-End线程本地样本：5287

4-Running field sample

4-End野外样本：4849

4-运行线程本地样本

4-End线程本地样本：5309

5-Running field sample

5-End野外样本：4781

5-运行线程本地样本

5-End线程本地样本：5330

6-Running field sample

6-End野外样本：5294

6-运行线程本地样本

6-End thread local local sample：5511

7-Running field sample

7-End野外样本：5119

7-运行线程本地样本

7-End线程本地样本：5793

8-Running field sample

8-End野外样本：4977

8-运行线程本地样本

8-end线程本地样本：6374

9-Running field sample

9-End野外样本：4841

9-运行线程本地样本

9-End线程本地样本：5471

现场平均值：5051

ThreadLocal avg：5664

ENV：

openjdk version <！> quot; 1.8.0_131 <！> quot;

英特尔＃174 <！>;核心＃8482 <！>; i7-7500U CPU @ 2.70GHz <！>＃215; 4

Ubuntu 16.04 LTS

在优化之前，

@Pete是正确的测试。

如果构建MessageDigest与使用它时相比有任何严重的开销，我会感到非常惊讶。

使用ThreadLocal的错误可能是漏洞和悬空引用的来源，没有明确的生命周期，通常我没有使用ThreadLocal而没有非常明确的计划何时删除特定资源。

构建并测量它。

此外，如果将消息摘要行为封装到对象中，则只需要一个threadlocal。如果出于某种目的需要本地MessageDigest和本地字节[1000]，请创建一个带有messageDigest和byte []字段的对象，并将该对象放入ThreadLocal而不是单独放入。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow