Переглянути джерело

Faster overlapping copies

Eliminates bounds check on every byte copied.

Benchmark measured on AMD64 but with `-tags=noasm`:

```
>benchstat old.txt new.txt
name        old time/op    new time/op    delta
_UFlat0-8      194µs ± 3%     150µs ± 2%  -22.59%  (p=0.000 n=10+10)
_UFlat1-8     1.62ms ± 1%    1.41ms ± 2%  -12.70%   (p=0.000 n=9+10)
_UFlat2-8     8.91µs ± 4%    8.76µs ± 2%     ~     (p=0.343 n=10+10)
_UFlat3-8      222ns ± 2%     224ns ± 1%   +1.00%   (p=0.028 n=10+9)
_UFlat4-8     28.4µs ± 2%    20.3µs ± 3%  -28.45%  (p=0.000 n=10+10)
_UFlat5-8      797µs ± 5%     603µs ± 2%  -24.34%   (p=0.000 n=10+9)
_UFlat6-8      565µs ± 1%     531µs ± 2%   -6.16%    (p=0.000 n=8+9)
_UFlat7-8      494µs ± 4%     457µs ± 2%   -7.61%  (p=0.000 n=10+10)
_UFlat8-8     1.55ms ± 4%    1.40ms ± 2%   -9.48%   (p=0.000 n=10+9)
_UFlat9-8     1.93ms ± 1%    1.83ms ± 2%   -5.44%   (p=0.000 n=10+9)
_UFlat10-8     186µs ± 2%     138µs ± 5%  -26.04%  (p=0.000 n=10+10)
_UFlat11-8     524µs ± 2%     478µs ± 3%   -8.68%  (p=0.000 n=10+10)

name        old speed      new speed      delta
_UFlat0-8    528MB/s ± 3%   682MB/s ± 2%  +29.18%  (p=0.000 n=10+10)
_UFlat1-8    434MB/s ± 1%   497MB/s ± 2%  +14.56%   (p=0.000 n=9+10)
_UFlat2-8   13.8GB/s ± 4%  14.1GB/s ± 2%     ~     (p=0.353 n=10+10)
_UFlat3-8    901MB/s ± 1%   890MB/s ± 1%   -1.18%    (p=0.008 n=9+9)
_UFlat4-8   3.60GB/s ± 2%  5.03GB/s ± 3%  +39.76%  (p=0.000 n=10+10)
_UFlat5-8    514MB/s ± 5%   679MB/s ± 2%  +32.04%   (p=0.000 n=10+9)
_UFlat6-8    269MB/s ± 1%   287MB/s ± 2%   +6.57%    (p=0.000 n=8+9)
_UFlat7-8    253MB/s ± 4%   274MB/s ± 2%   +8.23%  (p=0.000 n=10+10)
_UFlat8-8    276MB/s ± 4%   305MB/s ± 2%  +10.43%   (p=0.000 n=10+9)
_UFlat9-8    249MB/s ± 1%   263MB/s ± 2%   +5.76%   (p=0.000 n=10+9)
_UFlat10-8   637MB/s ± 2%   862MB/s ± 5%  +35.25%  (p=0.000 n=10+10)
_UFlat11-8   352MB/s ± 2%   385MB/s ± 3%   +9.51%  (p=0.000 n=10+10)
```
Klaus Post 6 роки тому
батько
коміт
f6ad6c8bb8
1 змінених файлів з 9 додано та 2 видалено
  1. 9 2
      decode_other.go

+ 9 - 2
decode_other.go

@@ -90,9 +90,16 @@ func decode(dst, src []byte) int {
 		// forwards, even if the slices overlap. Conceptually, this is:
 		//
 		// d += forwardCopy(dst[d:d+length], dst[d-offset:])
-		for end := d + length; d != end; d++ {
-			dst[d] = dst[d-offset]
+		//
+		// We align the slices into a and b and show the compiler they are the same size.
+		// This allows the loop to run without bounds checks.
+		a := dst[d : d+length]
+		b := dst[d-offset:]
+		b = b[:len(a)]
+		for i := range a {
+			a[i] = b[i]
 		}
+		d += length
 	}
 	if d != len(dst) {
 		return decodeErrCodeCorrupt