-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Matmul tests don't seem to be representative (Julia vs Python experiments) #312
Comments
Hi, this is very interesting results, you should post this as article somewhere(on reddit for example). I think multithread comparison out of this project, and multithreaded versions was added just as additional comparisons. |
You've raised a lot of questions here, but I'm not sure we want to go deeper here because there is confusion about the benchmarks. After we resolve it, we can discuss the matters in details. We have the reference implementations there (without any libraries usage) - those are the real benchmarks of the languages (and you could easily see that Julia even without BLAS is much faster than Python). However, in real life we use some libraries, that's why they're included, but only as samples. And the proper comparison regarding that would be not only about Python+NumPy vs Julia+OpenBLAS with the different inputs and thread counts, but also:
As you see, one would need a special benchmark just for the linear algebra comparisons. However, I think you're wondering why Julia+OpenBLAS was slower than Python+NumPy. First, let me note that the number of threads doesn't matter (I've tested both examples with
Python uses special NumPy API, and Julia doesn't use anything, the matrices are created as usual arrays. It means that Julia has overhead for communicating with OpenBLAS as it needs to transform the Julia's arrays into OpenBLAS input data. If you use the proper API, I think that Julia will have better numbers (that's my educated guess though, and if you're willing to look into that, we'd gladly accept PR with the optimized |
@jtrakk You should check the code before commenting it's already using warm start. left = calc(101)
right = -18.67
if abs(left - right) > 0.1
println(stderr, "$(left) != $(right)")
exit(1)
end
notify("Julia (no BLAS)\t$(getpid())")
t = time()
results = calc(n)
elapsed = time() - t
notify("stop") |
Not exactly to the discussion, but closely related to the topic. I've just now spotted, that unlike many other languages (Rust, V, Nim, etc.) which do bounds checking, the solution in Any reasons bounds checking is turned off in Julia? (other languages also support directives and means to suppress bounds checking - but these are not considered "standard practice" in their ecosystems even when doing scientific calculations - I'd also guess Julia ecosystem also doesn't encourage turning off bounds checking) |
i think bound checking should be enabled for all languages, as it likely would be enabled in production. but instead we just should use language constructs to avoid checkings, like iterators. |
There is a reason why we don't have a consistency here - because of "cheating" underlying libraries for some languages. We may have safe bounds-checking and/or iterator based code, but it will never beat NumPy, OpenBLAS etc as they have quite unsafe and aggressively optimized C and Fortran code. Therefore, there is no way we can have the fair comparison here, so I allowed to skip bounds checking if possible. If we introduce it, then we have to rewrite the existing code (for example, use C++'s |
This is IMHO totally fine and understandable. I was just surprised by explicit disallowance of bounds checking in Julia (an average developer incl. myself wouldn't turn off bounds checking in this case IMHO), nothing more, nothing less 😉. |
The following tests were performed with:
Python 3.8.2 + numpy 1.20.1
Julia 1.5.3
Windows 10
Ryzen 7 1700
I am currently trying to figure out which language to use for my project which will include a lot of smaller matrix multiplications. I looked at Rust, Julia as well as Python. According to the Matmul benchmark here, Julia-native is twice as slow as Python-numpy with very little variance. That seemed kinda odd, since Julia was designed to be used in the scientific realm/Computer Science, where one would assume that linear algebra is a foundation for many things - if they found Julia to be that much slower, they surely would have opted to have a look at numpy and improve their own implementation. It's probably the one thing Julia should be somewhat good at. ^^ I tested the two scripts (which I modified to output on stdout) but had about the same results. But only for a very small matrix sizes and mismatched thread count.
Experiments:
So I tried it with 10000 as a parameter for the matrix sizing instead:
Hm. Why Julia is using 8 threads on my 16 thread CPU.. I have no idea and it should maybe be changed by the Julia team as well as the fact that BLAS is ignoring the
--thread
option and the thread environment variable but instead has to be manually set in code or the specific BLAS environment variable.I changed the code to allow to specify the number of threads and tested again, for 4, 8 and 16 threads each (10000 again as a param). Note, that I ran every test only once. After the change to
calc(n + (i % 2))
(see below) I reran the tests but the outcome was roughly the same in terms of relative time difference. The code in the brackets is how you would call the final code down below - parameters arematrix sizing
,number of iterations
andnumber of threads to use
:julia matmul-native.jl 10000 1 4
)python matmul-numpy.py 10000 1 4
)julia matmul-native.jl 10000 1 8
)python matmul-numpy.py 10000 1 8
)julia matmul-native.jl 10000 1 16
)python matmul-numpy.py 10000 1 16
)Then I looked into the Matmul Makefile and found out that 1500 was used there:
julia matmul-native.jl 1500 1 4
)python matmul-numpy.py 1500 1 4
)julia matmul-native.jl 1500 1 8
)python matmul-numpy.py 1500 1 8
)julia matmul-native.jl 1500 1 16
)python matmul-numpy.py 1500 1 16
)So, for larger matrices, Julia appears to be faster, given the same thread count. Even for the size tested here, I get slightly faster computation with Julia. Weirdly enough my Python numbers weren't far off from the ones on the front page, but Julia on my machine performed far better than the roughly 200ms with 8 threads in the benchmarks here - for me it only took 110 ms. Probably something for the Julia people to look into(?)
But what I need for my project is lots of small multiplications. So I added yet another parameter to set a number of times
calc()
should be called in a loop. I changedcalc(n)
tocalc(n + (i % 2))
to avoid optimization/unrolling with results getting summed up. I tested 100 as the matrix size:python matmul-numpy.py 100 100000 1
)julia matmul-native.jl 100 100000 1
)python matmul-numpy.py 100 100000 2
)julia matmul-native.jl 100 100000 2
)python matmul-numpy.py 100 100000 4
)julia matmul-native.jl 100 100000 4
)python matmul-numpy.py 100 100000 8
)julia matmul-native.jl 100 100000 8
)python matmul-numpy.py 100 100000 16
)julia matmul-native.jl 100 100000 16
)Yes, numpy and Julia with more threads were both slower than with less threads for me - my guess is that this is due to the thread creation/sleep overhead on such a small task. Julia peaked at 4 threads, numpy at 2.
Conclusion
I am unsure how valuable the current benchmarks for matrix multiplication is. There seems to be a small overhead for the first call in Julia and I would assume this to be the case for many other languages as well. The question I want to raise is: Is a single small-ish matrix multiplication in the realm of 100ms a benchmark that is of value to anyone, when successive calls and bigger multiplications are all much faster? From my point of view no, and it seems a bit misleading. I would thus suggest to perform multiple sub-benchmarks and post their results (and maybe a summed up number for the "overall duration") with bigger matrices+no repetition and smaller matrices with and without repetition. I think that would give better insights into how which language performs why.
One thing I am not sure about: According to the main README the CPU used in the benchmarks is a Intel(R) Core(TM) i7-10710U - that CPU has 12 threads. Why was Julia tested with 8 and not the full 12 threads (which e.g. numpy by default will use)? My guess is, it's because it just uses 8 threads by default but it wasn't manually set to 12 so it just used 8 instead, but I don't know for certain.
Code:
and
The text was updated successfully, but these errors were encountered: