I know that parallelizing over that many compute units is very tricky, but 50 to 62% utilization doesn't seem very good. With those numbers, over half of money you are putting in is getting wasted.
What I don't know, however, is the utilization numbers of prior approaches. Do you (or the paper) have these numbers? It would be a great addition to the article to have some reference numbers from prior work.
Thanks :)