Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've checked with ClickHouse and the result is better than I expect... it runs in 0.043 sec. on my machine, which is faster than any other result.

The code:

SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC FORMAT Null

or:

clickhouse-local --query "SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC" > /dev/null

It is using only a single thread.




>it runs in 0.043 sec. on my machine, which is faster than any other result.

Did you run the other benchmarks on your machine as well?


Yes (but only scripted, without compilation):

`grep` | 0.03 | 0.03 | `grep` baseline; optimized sets `LC_ALL=C`

`wc -w` | 0.18 | 0.25 | `wc` baseline; optimized sets `LC_ALL=C`

SQL | 0.26 | | by Alexey Milovidov

Perl | 1.22 | | by Charles Randall

Python | 1.42 | 0.86 |

Tcl | 5.30 | | by William Ross

Shell | 9.66 | 1.79 | optimized does `LC_ALL=C sort -S 2G`


N.B. the Tcl script is absurdly inefficient. A single simple optimization cuts the run time in half.


I forgot to multiply the file 10 times. When I do, the result is 0.209 sec. which is still better than every other result.


You are also using a language function to read the file. In the 'official' github implementations they have to accept the data line by line from stdin - stdin likely being slower than reading a file directly.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: