Effect of taskset on FermiGrid jobs

Adam Lyon, March 2018

Abstract

taskset -c 0 ... on FermiGrid leads to self-inflicted low efficiencies. It seems to constrain all jobs that use that command to the same core on the node, despite jobs running in separate Docker containers. Also, FermiGrid monitoring sometimes gets the user CPU time wrong.

Skip to the executive summary.

Introduction

Muon g-2 runs "unpacking" jobs to convert the MIDAS (our DAQ system) raw binary data files into art root files in preparation for Reconstruction. Because we have 24 independent calorimeters, we had made our calorimeter unpacking module multi-threaded so that it could unpack those banks in parallel. Running these jobs on FermiGrid was especially difficult because of very large memory needs. Most all jobs required 8 GB and some required 12 GB or more. In order for these jobs to run, we needed to request 4 or 6 cores in the job submission. The CPU efficiency for these jobs reported by FermiGrid was very low (< 20%). The experiment had a very difficult time running these important jobs.

Muon g-2 formed a task force to address this problem and discovered the following:

  • The threading system (Intel Thread Building Blocks - TBB) is initialized by art and asks to start the "default" number of threads
  • On FermiGrid, the default number of threads is the maximum number for the node (e.g. 52 threads for a 26 core hyperthreaded node), even if we only asked for 6 cores (FermiGrid will supply an environment variable that will have the requested # of cores)
  • There is a memory allocation per core, and that seemed to be a cause of our large memory foot-print.
  • Because almost all of the CPU time was going to single-threaded Root i/o (espeically compression), the extra threads were idle almost all the time, leading to the very low efficiency

Since multithreading is not useful for these jobs, the solution is to run single threaded. We would very much like to avoid changing our g-2 code and so there are three options to externally contrain the job to one thread.

  1. Set a FHICL parameter in art to specify the number of threads. Due to a bug, this parameter does not work in the current release of g-2 code. We are moving as fast as possible to a new release where this parameter will work.
  2. Use the numactl command to constrain the number of cores. This command is not available on FermiGrid
  3. Use the taskset command to contstrain an executable to a particular core.

The only option that works for us currently is #3. In our job script, we start the art executable (for us called gm2) this way,

taskset -c 0 gm2 ...

We have implemented this change fo our unpacking jobs. We saw our efficiency improve greatly, but there was still a wide range of efficiences for the jobs from low to high. Let's create a controlled environment and see if taskset is doing something evil.

In [2]:
# Load Julia packages
using AnalyzeJobs  # This is my package that does elastic searches for job information
using DataFrames
using DataFramesMeta
using StatPlots
using FreqTables
using Lazy: @>

AllCPU code

I devised a job that takes 100% of a core with no i/o and no CVMFS. Here is the C code, built with the native gcc C compiler,

/* Build with gcc -std=c99 allCPU.c -o allCPU */

#include <stdio.h>
#include <time.h>
#include <stdlib.h>

// the function f() does some time-consuming work
void f()
{
  volatile double d = 0;
  for(int n=0; n<10000; ++n)
    for(int m=0; m<10000; ++m)
      d += d*n*m;
}

int main(int argc, char* argv[])
{
  int nSec;
  clock_t t, t0;
  double time_taken;

  t0 = clock();

  if (argc != 2 ) {
    printf("Usage: allCPU <seconds>\n");
    return 1;
  }

  nSec = atoi(argv[1]);
  printf("Running for at least %d seconds\n", nSec);

  while ( 1 == 1 ) {

      f();

      t = clock() - t0;
      time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds

      if (time_taken >= nSec ) break;

  }

  printf("f() took %f CPU time seconds to execute \n", time_taken);

  return 0;
}

The job runs for at least as many seconds as the argument specifies (maybe a little more, depending on how long the f() function takes). The job script is quite simple with the important part of,

# Non-taskset job
time ./allCPU 300

or

# taskset job
time taskset -c 0 ./allCPU 300

Hence, the job will run for at least five minutes with our without taskset.

I ran 200 jobs of each variety (non-taskset and taskset) on FermiGrid.

We can examine the performance and FermiGrid metrics for these jobs.

Non-taskset jobs (4380837)

I ran 200 jobs without the use of taskset (see above).

In [224]:
nonTasksetJobs = jobInfoMain(4380837,"2018.03.06", "jobsub01.fnal.gov") ; head(nonTasksetJobs)  # Just show the first 6 entries
Out[224]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeff
115203256330.0288.0304.0false1520325329[email protected]FermiGridfnpc9035.fnal.govlyon16446440.0288.0304.00.947368
215203256390.0287.0307.0false1520325332[email protected]FermiGridfnpc6022.fnal.govlyon16366360.0287.0307.00.934853
315203256390.0287.0307.0false1520325332[email protected]FermiGridfnpc17159.fnal.govlyon16366360.0287.0307.00.934853
415203256390.0287.0305.0false1520325334[email protected]FermiGridfnpc5008.fnal.govlyon16406400.0287.0305.00.940984
515203256420.0286.0308.0false1520325334[email protected]FermiGridfnpc7008.fnal.govlyon16406400.0286.0308.00.928571
615203256420.0286.0310.0false1520325332[email protected]FermiGridfnpc5003.fnal.govlyon16446440.0286.0310.00.922581
In [3]:
h1 = @df nonTasksetJobs histogram(:usercpu, nbins=20, line=nothing, legend=false, 
                                  title="Non-taskset 4380837", ylab="counts")
h2 = @df nonTasksetJobs histogram(:usercpu, nbins=20, line=nothing, legend=false,
                                  yaxis=:log10, xlab="FermiGrid User CPU time (s)", ylab="counts (log10)")
plot(h1, h2, layout=(2,1))
Out[3]:
240 250 260 270 280 290 300 0 50 100 150 Non-taskset 4380837 counts 240 250 260 270 280 290 300 10 0.0 10 0.5 10 1.0 10 1.5 10 2.0 FermiGrid User CPU time (s) counts (log10)

This looks strange. The jobs were all supposed to run for 300s (5min) at least. Let's look at the wall clock time.

In [4]:
h1 = @df nonTasksetJobs histogram(:wallTime, nbins=20, line=nothing, legend=false, 
                                  title="Non-taskset 4380837", ylab="counts")
h2 = @df nonTasksetJobs histogram(:wallTime, nbins=20, line=nothing, legend=false,
                                  yaxis=:log10, xlab="FermiGrid Wall clock time (s)", ylab="counts (log10)")
plot(h1, h2, layout=(2,1))
Out[4]:
305 310 315 320 325 330 0 10 20 30 40 Non-taskset 4380837 counts 305 310 315 320 325 330 10 0.0 10 0.5 10 1.0 10 1.5 FermiGrid Wall clock time (s) counts (log10)

Let's see what the jobs say. I have all of the job log files and I ran the following to extract the executable wall and user time respectively,

grep '^real' jobScript.sh_20180306_022044_2665212_0_1_*.err > nontaskset_wall.txt
grep '^user' jobScript.sh_20180306_022044_2665212_0_1_*.err > nontaskset_user.txt

A line in one of these files looks like

jobScript.sh_20180306_022044_2665212_0_1_cluster.4380837.119.err:user\t5m0.652s

So we need to pull out the relevant parts. Let's write a function to make those data frames.

In [5]:
const exeExtractRE = r"cluster\.(\d+)\.(\d+).+\t(\d+)m([\d\.]+)*s"
function exeTimingDF(filename)
    df = DataFrame([String, Float64], [:jobid, :sec], 0)
    open(filename) do f
        for l in eachline(f)
            R = match(exeExtractRE, l)
            cluster, part, min, sec = R.captures
            jobid = "$cluster.$part"
            theTime = parse(Float64, min)*60.0 + parse(Float64, sec)
            append!(df, DataFrame(jobid=jobid, sec=theTime))
        end
    end
    df
end
Out[5]:
exeTimingDF (generic function with 1 method)
In [6]:
userExeTimeNontaskset = exeTimingDF("nontaskset_user.txt") ; head(userExeTimeNontaskset)
Out[6]:
jobidsec
14380837.0299.304
24380837.1300.668
34380837.10300.176
44380837.100300.47
54380837.101300.176
64380837.102300.347
In [7]:
@df userExeTimeNontaskset histogram(:sec, line=nothing, legend=false, title="Non-taskset 4380837", xlab="User CPU time from Executable (s)")
Out[7]:
299.0 299.5 300.0 300.5 301.0 0 10 20 30 40 50 Non-taskset 4380837 User CPU time from Executable (s)

That's 5 minutes!

And look at the Wall clock time reported by the executable.

In [8]:
wallExeTimeNontaskset = exeTimingDF("nontaskset_wall.txt") 
@df wallExeTimeNontaskset histogram(:sec, line=nothing, legend=false, title="Non-taskset 4380837", xlab="Wall time from Executable (s)")
Out[8]:
300 301 302 303 304 305 306 0 10 20 30 40 50 60 Non-taskset 4380837 Wall time from Executable (s)

Let's do a direct comparison at the job level. This will require some gymnastics.

In [9]:
function mergeDFs(jobInfo, userExe, wallExe)
    # Extract out the job id and put it in a new column :jobid (this matches the exe data frames)
    jobInfoWithIds = @transform jobInfo jobid=map(i -> match(r"[\d\.]+", i).match, :JobsubJobId)

    # Join the job information data frame with the executable timing data frames
    outDF = join(jobInfoWithIds, userExe, on=:jobid, kind=:inner)
    outDF = join(outDF,          wallExe, on=:jobid, kind=:inner, makeunique=true)

    # Fix some column names
    rename!(outDF, [:sec => :exeCpu, :sec_1 => :exeWall])
    
    outDF
end

nonTasksetJobsWithExe = mergeDFs(nonTasksetJobs, userExeTimeNontaskset, wallExeTimeNontaskset)
head(nonTasksetJobsWithExe)
Out[9]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeffjobidexeCpuexeWall
115203256330.0288.0304.0false1520325329[email protected]1.fnal.govFermiGridfnpc9035.fnal.govlyon16446440.0288.0304.00.9473684380837.168300.464300.629
215203256330.0287.0304.0false1520325329[email protected]FermiGridfnpc17138.fnal.govlyon16366360.0287.0304.00.9440794380837.169300.247300.74
315203256350.0285.0305.0false1520325330[email protected]FermiGridfnpc7306.fnal.govlyon16366360.0285.0305.00.9344264380837.170300.099300.242
415203256360.0286.0305.0false1520325331[email protected]FermiGridfnpc17157.fnal.govlyon16366360.0286.0305.00.9377054380837.171299.889300.382
515203256380.0289.0305.0false1520325333[email protected]FermiGridfnpc9035.fnal.govlyon16406400.0289.0305.00.9475414380837.179300.237300.369
615203256380.0286.0306.0false1520325332[email protected]FermiGridfnpc9015.fnal.govlyon16366360.0286.0306.00.9346414380837.173300.256300.631
In [10]:
@assert nrow(nonTasksetJobs) == nrow(nonTasksetJobsWithExe)  # Check that we didn't lose any rows
In [251]:
userTimeDiff = @df nonTasksetJobsWithExe histogram(:usercpu - :exeCpu, legend=false, line=nothing,
               xlab="FermiGrid reported User CPU time - Executable reported User CPU time (s)")
Out[251]:
-60 -50 -40 -30 -20 -10 0 0 10 20 30 40 50 60 FermiGrid reported User CPU time - Executable reported User CPU time (s)
In [250]:
savefig("incorrectCPUTime.png")

I'm assuming that the executable CPU time is definitive. FermiGrid monitoring only got the CPU time close to correct for one job. The vast majority of jobs were off by about 15s. Several jobs were off by more than a minute! This will effect the FermiGrid reported efficiency.

In [12]:
@df nonTasksetJobsWithExe scatter(:wallTime, :exeWall, legend=false, title="Non-taskset 4380837 (note the axis ranges)",
                                             xlab="FermGrid Wall time (s)", ylab="Executable Wall time (s)" )
Out[12]:
305 310 315 320 325 300 301 302 303 304 305 306 Non-taskset 4380837 (note the axis ranges) FermGrid Wall time (s) Executable Wall time (s)

This plot kinda looks as expected. The FermiGrid Wall time should be a bit longer than the executable Wall time to account for transfer of the job payload to the worker node and other overhead. Perhaps one sees slow transfer rates for points on the far right of the plot.

Again, if we take the executable times as definitive, then the efficiency is...

In [254]:
nonTasksetExeEff = @df nonTasksetJobsWithExe histogram(:exeCpu./:exeWall, line=nothing, legend=false, 
                                    title="Non-taskset 4380837", xlab="Efficiency as reported by the executable (s)")
Out[254]:
0.980 0.985 0.990 0.995 1.000 0 20 40 60 Non-taskset 4380837 Efficiency as reported by the executable (s)

That looks right.

In [14]:
@df nonTasksetJobsWithExe histogram(:eff, line=nothing, legend=false, 
                                    title="Non-taskset 4380837", xlab="Efficiency as reported by FermiGrid (s)")
Out[14]:
0.75 0.80 0.85 0.90 0.95 0 10 20 30 40 50 Non-taskset 4380837 Efficiency as reported by FermiGrid (s)

The FermiGrid reported efficiency has many low values due to the mistakes in the monitoring of the CPU time.

Taskset jobs (4380943)

These jobs had taskset -c 0 set, constraining the job to "core #0" (one would think that would be core #0 in the docker container, but it looks like it is really core #0 on the node).

In [15]:
tasksetJobs = jobInfoMain(4380943,"2018.03.06", "jobsub01.fnal.gov") ; head(tasksetJobs)
Out[15]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeff
115203281520.0301.02672.0false1520325480[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02672.00.11265
215203281550.0300.02668.0false1520325487[email protected]FermiGridfnpc9086.fnal.govlyon16366360.0300.02668.00.112444
315203281660.0301.02633.0false1520325533[email protected]FermiGridfnpc8202.fnal.govlyon16406400.0301.02633.00.114318
415203283460.0301.02936.0false1520325410[email protected]FermiGridfnpc17113.fnal.govlyon16366360.0301.02936.00.10252
515203284120.0301.02617.0false1520325795[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02617.00.115017
615203279881.0300.02349.0false1520325639[email protected]FermiGridfnpc6022.fnal.govlyon16366361.0300.02349.00.12814
In [16]:
nrow(tasksetJobs)
Out[16]:
182

Hmm. We didn't get the 200 jobs that I launched. Look into that later. This time we'll concentrate on the timing information from the executable first.

In [17]:
userExeTimeTaskset = exeTimingDF("taskset_user.txt") ; 
wallExeTimeTaskset = exeTimingDF("taskset_wall.txt") ;
tasksetJobsWithExe = mergeDFs(tasksetJobs, userExeTimeTaskset, wallExeTimeTaskset)
head(tasksetJobsWithExe)
Out[17]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeffjobidexeCpuexeWall
115203281520.0301.02672.0false1520325480[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02672.00.112654380943.38300.1152665.74
215203281550.0300.02668.0false1520325487[email protected]FermiGridfnpc9086.fnal.govlyon16366360.0300.02668.00.1124444380943.49299.9782654.93
315203281660.0301.02633.0false1520325533[email protected]FermiGridfnpc8202.fnal.govlyon16406400.0301.02633.00.1143184380943.101300.2182625.55
415203283460.0301.02936.0false1520325410[email protected]FermiGridfnpc17113.fnal.govlyon16366360.0301.02936.00.102524380943.26299.9322930.22
515203284120.0301.02617.0false1520325795[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02617.00.1150174380943.190299.9482612.48
615203279881.0300.02349.0false1520325639[email protected]FermiGridfnpc6022.fnal.govlyon16366361.0300.02349.00.128144380943.146299.8962343.95

Let's look at the efficiency as reported by the executable.

In [18]:
l = @layout [ h1 h2 
               h3  ]
h1 = @df tasksetJobsWithExe histogram(:exeCpu, nbin=20, legend=false, line=nothing,
                     title="Taskset jobs 4380943", xlab="Executable CPU time (s)")
h2 = @df tasksetJobsWithExe histogram(:exeWall, nbin=20, legend=false, line=nothing, xlab="Executable Wall time (s)")
h3 = @df tasksetJobsWithExe histogram(:exeCpu./:exeWall, nbin=20, legend=false, line=nothing, xlab="Efficiency reported by executable")
plot(h1, h2, h3, layout=l)
Out[18]:
298 299 300 301 0 10 20 30 40 50 Taskset jobs 4380943 Executable CPU time (s) 500 1000 1500 2000 2500 3000 3500 0 5 10 15 20 25 Executable Wall time (s) 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 Efficiency reported by executable

Woah! That looks quite bad! What's going on here? Burt suggested that low efficiency could be due to some of these jobs running on the same node and getting core zero. Since all of these jobs specify taskset -c 0, all of them that land on the same node may run on the same core and interfere. Therefore, assuming that no one else on FermiGrid runs with taskset, we should see high efficiency for jobs that are on unique nodes and poor efficiency for jobs that share a node. Let's figure that out.

Taskset jobs - look for multiple jobs on same nodes

In [19]:
# Turn the machine name into a categorical variable
tasksetJobsWithExe = @transform tasksetJobsWithExe node=categorical(:MachineAttrMachine0) ; head(tasksetJobsWithExe)
Out[19]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeffjobidexeCpuexeWallnode
115203281520.0301.02672.0false1520325480[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02672.00.112654380943.38300.1152665.74fnpc17153.fnal.gov
215203281550.0300.02668.0false1520325487[email protected]FermiGridfnpc9086.fnal.govlyon16366360.0300.02668.00.1124444380943.49299.9782654.93fnpc9086.fnal.gov
315203281660.0301.02633.0false1520325533[email protected]FermiGridfnpc8202.fnal.govlyon16406400.0301.02633.00.1143184380943.101300.2182625.55fnpc8202.fnal.gov
415203283460.0301.02936.0false1520325410[email protected]FermiGridfnpc17113.fnal.govlyon16366360.0301.02936.00.102524380943.26299.9322930.22fnpc17113.fnal.gov
515203284120.0301.02617.0false1520325795[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02617.00.1150174380943.190299.9482612.48fnpc17153.fnal.gov
615203279881.0300.02349.0false1520325639[email protected]FermiGridfnpc6022.fnal.govlyon16366361.0300.02349.00.128144380943.146299.8962343.95fnpc6022.fnal.gov

How many of our jobs landed on a particular node?

In [20]:
bynode = @> by(tasksetJobsWithExe, :node, nrow) rename!(:x1 => :nNodes) sort!(cols=order(:nNodes))
tail(bynode)
Out[20]:
nodenNodes
1fnpc7026.fnal.gov3
2fnpc7005.fnal.gov3
3fnpc9022.fnal.gov3
4fnpc7003.fnal.gov3
5fnpc5013.fnal.gov4
6fnpc17109.fnal.gov5

Merge that back into the data frame

In [21]:
nrow(bynode)
Out[21]:
124
In [22]:
tasksetJobsWithExeN = join(tasksetJobsWithExe, bynode, on=:node, kind=:left) ; head(tasksetJobsWithExeN)
Out[22]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeffjobidexeCpuexeWallnodenNodes
115203281520.0301.02672.0false1520325480[email protected]FermiGridfnpc17153.fnal.govlyon16366360.0301.02672.00.112654380943.38300.1152665.74fnpc17153.fnal.gov3
215203281550.0300.02668.0false1520325487[email protected]FermiGridfnpc9086.fnal.govlyon16366360.0300.02668.00.1124444380943.49299.9782654.93fnpc9086.fnal.gov3
315203281660.0301.02633.0false1520325533[email protected]FermiGridfnpc8202.fnal.govlyon16406400.0301.02633.00.1143184380943.101300.2182625.55fnpc8202.fnal.gov2
415203283460.0301.02936.0false1520325410[email protected]FermiGridfnpc17113.fnal.govlyon16366360.0301.02936.00.102524380943.26299.9322930.22fnpc17113.fnal.gov1
515203284120.0301.02617.0false1520325795[email protected]al.govFermiGridfnpc17153.fnal.govlyon16366360.0301.02617.00.1150174380943.190299.9482612.48fnpc17153.fnal.gov3
615203279881.0300.02349.0false1520325639[email protected]FermiGridfnpc6022.fnal.govlyon16366361.0300.02349.00.128144380943.146299.8962343.95fnpc6022.fnal.gov3

For example,

In [23]:
@> @where(tasksetJobsWithExeN, :node .== "fnpc17153.fnal.gov") @select(:JobsubJobId, :node, :nNodes, :exeCpu, :exeWall)
Out[23]:
JobsubJobIdnodenNodesexeCpuexeWall
1[email protected]fnpc17153.fnal.gov3300.1152665.74
2[email protected]fnpc17153.fnal.gov3299.9482612.48
3[email protected]fnpc17153.fnal.gov3300.0632647.57

Is there a correlation between the efficiency and the number of jobs that were running on a particular node.

In [226]:
@df tasksetJobsWithExeN scatter(:nNodes, :exeCpu./:exeWall, legend=false, ylim=(0, 1),
                                 title="Taskset jobs 4380943",
                                 xlab="Number of 4380943 jobs running on a particular node", ylab="efficiency reported by executable" )
Out[226]:
1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 Taskset jobs 4380943 Number of 4380943 jobs running on a particular node efficiency reported by executable

So if the theory that we're seeing poor efficiency because jobs assigned to the same node compete for core #0 is correct, then we should see high efficiency for "one node", ~ 50% efficiency for "two nodes" and so on. That is not what we see. When our job lands on a node alone, that job can still have low efficiency. And some jobs that got the same node can still have high efficiency. This theory is not supported by these data.

Let's look at the FermiGrid reported efficiency, with all its warts.

In [227]:
@df tasksetJobsWithExeN scatter(:nNodes, :eff, legend=false, ylim=(0, 1),
                                 title="Taskset jobs 4380943",
                                 xlab="Number of 4380943 jobs running on a particular node", ylab="efficiency reported by executable" )
Out[227]:
1 2 3 4 5 0.0 0.2 0.4 0.6 0.8 1.0 Taskset jobs 4380943 Number of 4380943 jobs running on a particular node efficiency reported by executable

Taskset jobs - Look for other g-2 jobs running on the same nodes

What if there were other jobs running from gm2pro? Those would be running with taskset -c 0 too. Let's find out. We know when our jobs start and end. Let's get the full range.

In [41]:
earliestStart = @> parse.(Int,tasksetJobsWithExeN[:JobStartDate]) minimum
@> earliestStart Dates.unix2datetime
Out[41]:
2018-03-06T08:35:56
In [44]:
latestEnd = @> parse.(Int, tasksetJobsWithExeN[:CompletionDate]) maximum
@> latestEnd Dates.unix2datetime
Out[44]:
2018-03-06T09:38:38

These times are in UTC, but that's ok. We'll keep in unix time. That's range isn't all that big (just about an hour).

In [46]:
(latestEnd - earliestStart)/60
Out[46]:
62.7

Let's look for all gm2pro jobs (and jobs run by Anna, as she was trying taskset) that occurred during that time and where they were running. We want to find jobs that started on or before the latestEnd OR ended on or after the earliestStart. We'll need to use the basic query functions of AnalyzeJobs.jl.

In [126]:
otherJobQuery = AnalyzeJobs.makeQueryJSON(
    "CompletionDate:>=$earliestStart AND JobStartDate:<=$latestEnd AND Owner:(gm2pro OR adriutti) AND "*
                "CumulativeRemoteUserCpu:>=1 AND MATCH_GLIDEIN_Site:FermiGrid",
    ["JobsubJobId","JobStartDate","CompletionDate","Owner","MachineAttrMachine0"], 5000)
otherJobRes = AnalyzeJobs.exeElasQuery(AnalyzeJobs.urlBase, otherJobQuery, "fifebatch-history-2018.03.*")
otherJobDF = AnalyzeJobs.resultsAsDF(otherJobRes)
otherJobDF = @transform(otherJobDF, 
                            Started=Dates.unix2datetime.(parse.(Int, :JobStartDate)),
                            Completed=Dates.unix2datetime.(parse.(Int, :CompletionDate))
            )
head(otherJobDF)
Out[126]:
CompletionDateJobStartDateJobsubJobIdMachineAttrMachine0OwnerStartedCompleted
115203259311520321057[email protected]fnpc7204.fnal.govgm2pro2018-03-06T07:24:172018-03-06T08:45:31
215203259681520322339[email protected]fnpc7304.fnal.govgm2pro2018-03-06T07:45:392018-03-06T08:46:08
315203259711520318900[email protected]fnpc17139.fnal.govgm2pro2018-03-06T06:48:202018-03-06T08:46:11
415203259721520317558[email protected]fnpc17131.fnal.govgm2pro2018-03-06T06:25:582018-03-06T08:46:12
515203256301520319349[email protected]fnpc9045.fnal.govgm2pro2018-03-06T06:55:492018-03-06T08:40:30
615203256461520323479[email protected]fnpc6010.fnal.govgm2pro2018-03-06T08:04:392018-03-06T08:40:46

How many?

In [71]:
nrow(otherJobDF)
Out[71]:
1531

Let's look for our jobs that ran on a machine that didn't get the gm2pro/anna jobs.

In [119]:
ourOneNode = @where(tasksetJobsWithExeN, :nNodes .== 1)[:MachineAttrMachine0];
In [120]:
ranTruelyAlone = setdiff(ourOneNode, otherJobDF[:MachineAttrMachine0])
Out[120]:
6-element Array{String,1}:
 "fnpc6020.fnal.gov"
 "fnpc5028.fnal.gov"
 "fnpc9029.fnal.gov"
 "fnpc7013.fnal.gov"
 "fnpc4571.fnal.gov"
 "fnpc4560.fnal.gov"
In [117]:
@> filter( r -> r[:MachineAttrMachine0]  ranTruelyAlone , tasksetJobsWithExeN) @select :node :eff :exeCpu :exeWall
Out[117]:
nodeeffexeCpuexeWall
1fnpc6020.fnal.gov0.934853300.218301.466
2fnpc5028.fnal.gov0.919614299.033302.947
3fnpc9029.fnal.gov0.902516300.356300.731
4fnpc7013.fnal.gov0.925566299.864302.073
5fnpc4571.fnal.gov0.919614300.438302.485
6fnpc4560.fnal.gov0.949527300.873306.287

Those are my taskset jobs that ran alone on its node. Note the very high efficiency.

All g-2 jobs with taskset

Let's get all of the taskset (mine and g-2 jobs) that ran when I was running my jobs.

In [127]:
allJobQuery = AnalyzeJobs.makeQueryJSON(
    "CompletionDate:>=$earliestStart AND JobStartDate:<=$latestEnd AND Owner:(gm2pro OR adriutti OR lyon) " * 
                  "AND CumulativeRemoteUserCpu:>=1 AND MATCH_GLIDEIN_Site:FermiGrid",
    ["JobsubJobId","JobStartDate","CompletionDate","Owner","MachineAttrMachine0", 
                                "CumulativeSlotTime", "CumulativeRemoteSysCpu", "CumulativeRemoteUserCpu"], 5000)
allJobRes = AnalyzeJobs.exeElasQuery(AnalyzeJobs.urlBase, allJobQuery, "fifebatch-history-2018.03.*")
allJobDF = AnalyzeJobs.resultsAsDF(allJobRes)
allJobDF = @transform(allJobDF, 
                            Started=Dates.unix2datetime.(parse.(Int, :JobStartDate)),
                            Completed=Dates.unix2datetime.(parse.(Int, :CompletionDate))
            )
head(allJobDF)
Out[127]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeJobStartDateJobsubJobIdMachineAttrMachine0OwnerStartedCompleted
115203259630.0286.0309.01520325654[email protected]fnpc7013.fnal.govlyon2018-03-06T08:40:542018-03-06T08:46:03
215203259710.0286.0308.01520325663[email protected]fnpc4528.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:11
315203259720.0300.0565.01520325407[email protected]fnpc9080.fnal.govlyon2018-03-06T08:36:472018-03-06T08:46:12
415203259742.0284.0311.01520325663[email protected]fnpc5028.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:14
515203259910.0287.0318.01520325673[email protected]fnpc9029.fnal.govlyon2018-03-06T08:41:132018-03-06T08:46:31
615203256330.0288.0304.01520325329[email protected]fnpc9035.fnal.govlyon2018-03-06T08:35:292018-03-06T08:40:33
In [124]:
nrow(allJobDF)
Out[124]:
1829
In [128]:
# Turn the machine name into a categorical variable
allJobDF = @transform allJobDF node=categorical(:MachineAttrMachine0) ; head(allJobDF)
Out[128]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeJobStartDateJobsubJobIdMachineAttrMachine0OwnerStartedCompletednode
115203259630.0286.0309.01520325654[email protected]fnpc7013.fnal.govlyon2018-03-06T08:40:542018-03-06T08:46:03fnpc7013.fnal.gov
215203259710.0286.0308.01520325663[email protected]fnpc4528.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:11fnpc4528.fnal.gov
315203259720.0300.0565.01520325407[email protected]fnpc9080.fnal.govlyon2018-03-06T08:36:472018-03-06T08:46:12fnpc9080.fnal.gov
415203259742.0284.0311.01520325663[email protected]fnpc5028.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:14fnpc5028.fnal.gov
515203259910.0287.0318.01520325673[email protected]fnpc9029.fnal.govlyon2018-03-06T08:41:132018-03-06T08:46:31fnpc9029.fnal.gov
615203256330.0288.0304.01520325329[email protected]fnpc9035.fnal.govlyon2018-03-06T08:35:292018-03-06T08:40:33fnpc9035.fnal.gov

How many of our jobs landed on a particular node?

In [232]:
allByNode = @> by(allJobDF, :node, nrow) rename!(:x1 => :nJobsOnNode) sort!(cols=order(:nJobsOnNode))  # Sorted
head(allByNode) # Look at the beginning of the list
Out[232]:
nodenJobsOnNode
1fnpc4571.fnal.gov1
2fnpc4560.fnal.gov1
3fnpc6020.fnal.gov1
4fnpc4546.fnal.gov1
5fnpc4531.fnal.gov1
6fnpc9034.fnal.gov1
In [233]:
tail(allByNode)  # Look at the end
Out[233]:
nodenJobsOnNode
1fnpc7309.fnal.gov14
2fnpc8204.fnal.gov14
3fnpc17121.fnal.gov15
4fnpc7307.fnal.gov15
5fnpc17134.fnal.gov16
6fnpc17123.fnal.gov16
In [238]:
@df allByNode histogram(:nJobsOnNode, nbins=20, line=nothing, legend=false, xlab="Number of jobs on node")
Out[238]:
3 6 9 12 15 0 10 20 30 40 Number of jobs on node
In [140]:
allJobDFN = join(allJobDF, allByNode, on=:node, kind=:left) ; head(allJobDFN)
Out[140]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeJobStartDateJobsubJobIdMachineAttrMachine0OwnerStartedCompletednodenJobsOnNode
115203259630.0286.0309.01520325654[email protected]fnpc7013.fnal.govlyon2018-03-06T08:40:542018-03-06T08:46:03fnpc7013.fnal.gov2
215203259710.0286.0308.01520325663[email protected]fnpc4528.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:11fnpc4528.fnal.gov3
315203259720.0300.0565.01520325407[email protected]fnpc9080.fnal.govlyon2018-03-06T08:36:472018-03-06T08:46:12fnpc9080.fnal.gov4
415203259742.0284.0311.01520325663[email protected]fnpc5028.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:14fnpc5028.fnal.gov2
515203259910.0287.0318.01520325673[email protected]fnpc9029.fnal.govlyon2018-03-06T08:41:132018-03-06T08:46:31fnpc9029.fnal.gov2
615203256330.0288.0304.01520325329[email protected]fnpc9035.fnal.govlyon2018-03-06T08:35:292018-03-06T08:40:33fnpc9035.fnal.gov4
In [175]:
allJobDFN = @transform(allJobDFN,
                      usercpu=parse.(Float64, :CumulativeRemoteUserCpu),
                      wallTime=parse.(Float64, :CumulativeSlotTime))   # Convert strings to numbers

allJobDFN = @transform(allJobDFN, eff=(:usercpu)./:wallTime) ; head(allJobDFN)   # Calculate efficiency (good enough)
allJobDFN = @transform(allJobDFN, topJob = map(i -> match(r"[\d]+", i).match, :JobsubJobId) )  # Determine the main job # (before the .)
head(allJobDFN)
Out[175]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeJobStartDateJobsubJobIdMachineAttrMachine0OwnerStartedCompletednodenJobsOnNodeusercpuwallTimeefftopJob
115203259630.0286.0309.01520325654[email protected]fnpc7013.fnal.govlyon2018-03-06T08:40:542018-03-06T08:46:03fnpc7013.fnal.gov2286.0309.00.9255664380943
215203259710.0286.0308.01520325663[email protected]fnpc4528.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:11fnpc4528.fnal.gov3286.0308.00.9285714380943
315203259720.0300.0565.01520325407[email protected]fnpc9080.fnal.govlyon2018-03-06T08:36:472018-03-06T08:46:12fnpc9080.fnal.gov4300.0565.00.5309734380943
415203259742.0284.0311.01520325663[email protected]fnpc5028.fnal.govlyon2018-03-06T08:41:032018-03-06T08:46:14fnpc5028.fnal.gov2284.0311.00.9131834380943
515203259910.0287.0318.01520325673[email protected]fnpc9029.fnal.govlyon2018-03-06T08:41:132018-03-06T08:46:31fnpc9029.fnal.gov2287.0318.00.9025164380943
615203256330.0288.0304.01520325329[email protected]fnpc9035.fnal.govlyon2018-03-06T08:35:292018-03-06T08:40:33fnpc9035.fnal.gov4288.0304.00.9473684380837

One of these jobs is my non-taskset job. Let's remove it for now.

In [182]:
allJobDFN_notaskset = @where allJobDFN :topJob .!= "4380837" ;
In [256]:
allTasksetJobs = @df allJobDFN_notaskset boxplot(:nJobsOnNode, :eff, legend=false, ylim=(0, 1.0),
                                 xlab="Number of taskset g-2 jobs running on a particular node", ylab="efficiency reported by job" )
Out[256]:
3 6 9 12 15 0.0 0.2 0.4 0.6 0.8 1.0 Number of taskset g-2 jobs running on a particular node efficiency reported by job

This plot shows the efficiency distribution vs. the number of g-2 jobs with taskset running on the node. A clear degradation is seen.

In [267]:
allJobDFN_lyon = @where allJobDFN :Owner .== "lyon" ; nrow(allJobDFN_lyon)
twoMainJobs = @df allJobDFN_lyon boxplot(:nJobsOnNode, :eff, ylim=(0, 1.05), group=:topJob, alpha=0.5, legend=:bottomleft, 
                                 xlab="Number of g-2 jobs running on a particular node", ylab="efficiency reported by job" )
Out[267]:
3 6 9 12 15 0.00 0.25 0.50 0.75 1.00 Number of g-2 jobs running on a particular node efficiency reported by job 4380837 4380943
  • 4380837 (blue) is my non-taskset job
  • 4380943 (brown) is my taskset job

There's a clear difference.

Addendum - Exploring the CPU time differences

Burt Holzman asked me to run longer jobs to further explore the differences in the User CPU time rpeorted by FermiGrid and that reported by the executable. I submitted 200 jobs to run for an hour (without taskset). All jobs should have a User Cpu time of 3600 seconds (one hour).

In [4]:
longerJobs = jobInfoMain(5010296,"2018.03.16", "jobsub02.fnal.gov") ; head(longerJobs)
Out[4]:
CompletionDateCumulativeRemoteSysCpuCumulativeRemoteUserCpuCumulativeSlotTimeExitBySignalJobStartDateJobsubJobIdMATCH_GLIDEIN_SiteMachineAttrMachine0OwnerRequestCpusResidentSetSize_RAWRSSsyscpuusercpuwallTimeeff
115211731480.03587.03613.0false1521169535[email protected]FermiGridfnpc7027.fnal.govlyon15285280.03587.03613.00.992804
215211731480.03585.03612.0false1521169536[email protected]FermiGridfnpc7030.fnal.govlyon16726720.03585.03612.00.992525
315211731490.03586.03611.0false1521169538[email protected]FermiGridfnpc5010.fnal.govlyon18408400.03586.03611.00.993077
415211731502.03583.03614.0false1521169536[email protected]FermiGridfnpc5010.fnal.govlyon18168162.03583.03614.00.991976
515211731530.03540.03618.0false1521169535[email protected]FermiGridfnpc6031.fnal.govlyon18448440.03540.03618.00.978441
615211731680.03598.03632.0false1521169536[email protected]FermiGridfnpc9070.fnal.govlyon15245240.03598.03632.00.990639
In [5]:
nrow(longerJobs)
Out[5]:
200
In [9]:
@df longerJobs histogram(:usercpu.-3600, nbins=20, line=nothing, legend=false, xlab="User CPU time (difference from one hour) (s) ")
Out[9]:
-60 -40 -20 0 0 20 40 60 80 User CPU time (difference from one hour) (s)
In [10]:
@df longerJobs histogram(:eff, nbins=20, line=nothing, legend=false, xlab="Efficiency")
Out[10]:
0.96 0.97 0.98 0.99 0 20 40 60 Efficiency

So it looks like the jitter in measuring the User CPU time is on the order of a minute. The impact of this jitter is less for long running jobs.

Conclusions and Executive Summary

See above for the detailed analysis.

A job was created that takes 100% of a CPU core and has no I/O. Jobs that ran with taskset -c 0 ... seem to be constrained to core #0 on the node. Therefore, if multiple jobs land on the same node, they all share the core and experience degraded efficiency. Jobs that were not run with taskset experienced high efficiency. However, the USER CPU time reported by FermiGrid was sometimes inaccurate, leading to false worse efficiency values.

For jobs that ran without taskset, here is the difference between the User CPU time reported by FermiGrid and the same value reported by the executable (with the unix time command and extracted from the log files). Note that jobs that ran for a lot longer (an hour as opposed to five minutes) showed similar jitter in the FermiGrid reported User CPU time. Of course the impact on efficiency decreases the longer the jobs runs (see addendum above).

In [253]:
userTimeDiff
Out[253]:
-60 -50 -40 -30 -20 -10 0 0 10 20 30 40 50 60 FermiGrid reported User CPU time - Executable reported User CPU time (s)

Here is the efficiency for jobs not running with taskset as reported by the executable.

In [255]:
nonTasksetExeEff
Out[255]:
0.980 0.985 0.990 0.995 1.000 0 20 40 60 Non-taskset 4380837 Efficiency as reported by the executable (s)

Here is a plot of efficiency all g-2 jobs running with taskset -c 0 ... against the number of simultaneous jobs (with taskset) running on the node.

In [258]:
allTasksetJobs
Out[258]:
3 6 9 12 15 0.0 0.2 0.4 0.6 0.8 1.0 Number of taskset g-2 jobs running on a particular node efficiency reported by job

A clear degradation in efficiency is seen with the number of simultaneous jobs running on the node. This suggests that the jobs are all sharing the same core.

We can also compare the efficiencies of jobs run with taskset with a jobs that did not use taskset.

In [268]:
twoMainJobs
Out[268]:
3 6 9 12 15 0.00 0.25 0.50 0.75 1.00 Number of g-2 jobs running on a particular node efficiency reported by job 4380837 4380943

The blue box plots (job 4380837) represent the non-taskset jobs and the brown box plots (job 4380943) represent the taskset jobs. The difference is very clear.

Running many jobs with taskset leads to degraded efficiency. Muon g-2 will cease using taskset as quickly as possible. This analysis indicates that it is possible for people to run jobs that affect jobs from other owners.

Back matter

In [307]:
versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
In [222]:
# Set up the notebook style
function css_styling()
    styles = open("custom.css") do f
        readstring(f)
    end
end
display(HTML(css_styling()))