Incomplete list of jobs leads to jobs that do not get submitted
Created by: sc-jumper
I have a program that submits a number of jobs via JobManagerSGE.submit
Each job is relatively small but they require slightly different parameters for each job. Rather than create an array job, I am just submitting hundreds of smaller jobs. While I submit 300 jobs, I only complete about 295 jobs.
The standard output shows... 'jman'> | all.q : failure (70) -- ... ' was not executed successfully (maybe a time-out happened). Please check the log files
I check the logs for the failed job and I see File "/opt/gridengine/ots/spool/chl-compute1237-ib0/job_scripts/6965", line 29, in sys.exit(gridtk.script.jman.main()) File ".../gridtk/gridtk/script/jman.py", line 381, in main args.func(args) File ".../gridtk/gridtk/script/jman.py", line 224, in run_job jm.run_job(job_id, array_id) File ".../gridtk/gridtk/sge.py", line 179, in run_job raise ValueError("Could not find job id '%d' in the database'" % job_id) ValueError: Could not find job id '12345' in the database'
I went back, after receiving the above error and queries the sqlite database directly at that job Id, 12345, is there.
I added a "sleep" immediately following job = add_job(self.session... in gridtk.gridtk.sge.py import time time.sleep(1)
And this corrected the issue. I was not able to figure out why the database was not correctly adding the job id "in time" for the qsub command to execute, or if my sleep hack impacted something else. After adding sleep every job gets submitted (slowly) and every job completes.