6

I am porting this Python code…

with open(filename, 'r') as f:
    results = [np.array(line.strip().split(' ')[:-1], float)
               for line in filter(lambda l: l[0] != '#', f.readlines())]

…to Julia. I came up with:

results = [map(ss -> parse(Float64, ss), split(s, " ")[1:end-1])
        for s in filter(s -> s[1] !== '#', readlines(filename))];

The main reason for this porting is a potential performance gain, so I timed the two snippets in a Jupyter notebook:

  • using %%timeit
    • Python: 12.8 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    • Julia: @benchmark returns (among other things) mean time: 8.250 ms (2.62% GC). So far so good; I do get a performance boost.
  • However, when using @time:
    • I get something in the lines of 0.103095 seconds (130.44 k allocations: 11.771 MiB, 91.58% compilation time). From this thread I inferred that it was probably caused by my -> function declarations.

Indeed, if I replace my code with:

filt = s -> s[1] !== '#';
pars = ss -> parse(Float64, ss);
res = [map(pars, split(s, " ")[1:end-1])
        for s in filter(filt, readlines(filename))];

and time only the last line, I get a more encouraging 0.073007 seconds (60.58 k allocations: 7.988 MiB, 88.33% compilation time); hurray! However, it kind of defeats the purpose (at least as I understand it) of anonymous functions and might lead to a bunch of f1, f2, f3, … Giving a name to my Python lambda function out of the list comprehension does not seem to affect the Python’s runtime.

My question is: to get normal performances, should I systematically name my Julia functions? Note that this particular snippet is to be called in a loop over ~30k files. (Basically, what I am doing is reading files that are mixtures of space-separated floats and comment lines; each float line can have a different length, and I am not interested in the last element of the line. Any comments on my solutions are appreciated.)

(Side comment: wrapping s with strip completely messes up @benchmark, adding 10 ms to the mean, but does not seem to affect @time. Any reason why?)


Putting everything in a function as suggested by DNF fixes my “have to name my anonymous functions” problem. Using one of Vincent Yu’s formulations:

function results(filename::String)::Vector{Vector{Float64}}
    [[parse(Float64, s) for s in @view split(line, ' ')[1:end-1]]
        for line in Iterators.filter(!startswith('#'), eachline(filename))]
end

@benchmark results(FN)
BenchmarkTools.Trial: 
  memory estimate:  3.74 MiB
  allocs estimate:  1465
  --------------
  minimum time:     7.108 ms (0.00% GC)
  median time:      7.458 ms (0.00% GC)
  mean time:        7.580 ms (1.58% GC)
  maximum time:     9.538 ms (14.84% GC)
  --------------
  samples:          659
  evals/sample:     1

@time called on this function returns equivalent results after the first compilation run. I am happy with that.

However, this is my persisting issue with strip:

function results_strip(filename::String)::Vector{Vector{Float64}}
    [[parse(Float64, s) for s in @view split(strip(line), ' ')[1:end-1]]
        for line in Iterators.filter(!startswith('#'), eachline(filename))]
end

@benchmark results_strip(FN)
BenchmarkTools.Trial: 
  memory estimate:  3.74 MiB
  allocs estimate:  1465
  --------------
  minimum time:     15.155 ms (0.00% GC)
  median time:      15.742 ms (0.00% GC)
  mean time:        15.885 ms (0.75% GC)
  maximum time:     19.089 ms (10.02% GC)
  --------------
  samples:          315
  evals/sample:     1

The median time doubles. If I look at strip only:

function only_strip(filename::String)
    [strip(line) for line in Iterators.filter(!startswith('#'), eachline(filename))]
end

@benchmark only_strip(FN)
BenchmarkTools.Trial: 
  memory estimate:  1.11 MiB
  allocs estimate:  475
  --------------
  minimum time:     223.868 μs (0.00% GC)
  median time:      258.227 μs (0.00% GC)
  mean time:        325.389 μs (9.41% GC)
  maximum time:     56.024 ms (75.09% GC)
  --------------
  samples:          10000
  evals/sample:     1

Figures just do not add up. Could there be a type mismatch, should I cast the results to something else?