Binpacking SQS batches

Posted on May 14, 2020 by wjwh

The other day an ex-colleague messaged me with a question about efficiently packing SQS messages into batches. He knows I am easily nerd-sniped by these kinds of problem and passed it onto me. The system in question is a background job processing library for Ruby that uses AWS SQS as the job queue. Using SQS has several scaling benefits, since it can process a near infinite number of messages per second and has tremendous storage capacity. There have been occasions where a queue contained millions of unprocessed messages without any problems. However, scaling horizontally is usually not free and SQS is not without its limitations. One such limitation is that you can submit no more than 10 messages at a time and the maximum allowed size both per message and per 10-batch of messages is 256 KB.

A lot of the background jobs we had could generate additional jobs. Since you pay per request for SQS, it makes sense to save up messages until the end of the job, so you have maximum opportunity for batching up jobs into batches of ten. If you only have small messages this is easy as you won’t go over the 256 KB limit and can just use the each_slice method to split the array of jobs. However, when some messages are very large and others are quite small it will become much more problematic. If you are unlucky and have too many big messages in the same batch, the size limit might be exceeded and the request would fail. You want a way to distribute the messages across batches such that the total number of batches is minimized, and no batch is larger than 256 KB. This problem comes up in many different contexts (for example, trying to fit the maximum amount of virtual machines into a number of physical servers and various logistics problems) and is known as the bin packing problem.

Bin packing is NP-hard to solve optimally. However, you can use both greedy algorithms to get a solution quickly (even though it might not be an optimal solution) or you can try to cover the search space efficiently with various combinations of branch & bound algorithms. In this post we take a quick look at several ways to tackle this problem.

Problem setup

Say we have an array of messages to be submitted of the following sizes:

Two messages of 200 KB
Five messages of 100 KB
Five messages of 50 KB
Five messages of 5 KB
Five messages of 2 KB
Ten messages of 1 KB

From this we can immediately derive some some upper and lower bounds for how many batches we need. In the least optimal case we simply use batches of one message each. In that case we can be sure no batch is larger than 256 KB, but we’d need 10 + 5 + 5 + 5 + 5 + 2 = 32 batches. The lower bound of the number of batches we can reach based on batch size would be in the case where we can find a combination of messages that fills every message in every batch but the last, giving ceil(32 / 10) = 4 batches. The total size of all messages is 200*2 + 100*5 + 50*5 + 5*5 + 2*5 + 1*10 = 1195 KB, so we need at least ceil(1195 / 256) = 5 batches based on total message size. The actual minimum is the bigger of these two numbers, so 5 batches. Note that we don’t know if this is actually achievable (perhaps the message sizes interact in just the wrong way), but we definitely know we can’t go lower than that.

Simplest algorithm

The most straightforward way to approach this would be to simply start putting messages into a batch until the batch contains 10 messages or until adding the next message would overflow the batch:

BATCH_SIZE_LIMIT = 256 # batch size limit of SQS in kilobytes
BATCH_LENGTH_LIMIT = 10 # max number of messages per request to SQS

# Usually this would be filled with structs or classes, but for simplicity
# we'll just use some integers
message_size_array = [[200]*2, [100]*5, [50]*5, [5]*5, [2]*5, [1]*10].flatten

def split_into_batches_simple(message_size_array)
  current_batch = []
  current_batch_size = 0
  batches_so_far = []

  until message_size_array.empty?
    current_message_size = message_size_array.last
    if current_batch_size + current_message_size > BATCH_SIZE_LIMIT || 
         current_batch.length >= BATCH_LENGTH_LIMIT
      # time for a new batch!
      batches_so_far.push(current_batch)
      current_batch = []
      current_batch_size = 0
    else
      # add the message to this batch
      current_batch_size += current_message_size
      current_batch.push(message_size_array.pop)
    end
  end

  batches_so_far.push(current_batch)
  return batches_so_far
end

Testing this with split_into_batches_simple(message_size_array) gives a result of [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [2, 2, 2, 2, 2, 5, 5, 5, 5, 5], [50, 50, 50, 50, 50], [100, 100], [100, 100], [100], [200], [200]], so it uses 8 batches in total. This is almost double the optimal value of 5 we computed earlier. The smaller batches all contain 10 messages but don’t nearly sum to 256 KB, while the later batches are closer to the size limit but only contain one or two messages each. In a better solution, we could pack some of the smaller messages in with the bigger ones and save some batches.

Greedy algorithm

A greedy algorithm is a solution that follows some heuristic which is easier to compute than the entire problem, in the hope of finding “good” solutions even though they may not be the absolute best achievable. For bin packing a good heuristic is trying to fit the biggest elements first, since they are the most difficult to fit into a batch. This is formally known as the “first fit decreasing” heuristic. A good way to intuitively visualize this is that if you have a collection of gravel of varying sizes that you want to put into pots, if you put in the biggest rocks in first then you can pour the sand in later and it will fill up the voids between the rocks. However if you put in the sand first, the empty space left might not be able to fit all the rocks.

Using the same constants as above:

def split_into_batches_greedy(message_size_array)
  current_batch = []
  current_batch_size = 0
  batches_so_far = []

  reverse_sorted_message_size_array = message_size_array.sort.reverse

  until reverse_sorted_message_size_array.empty?
    index = 0
    while index < reverse_sorted_message_size_array.length
      # add the message to this batch if it fits and remove it
      # from the available list
      new_size = current_batch_size + reverse_sorted_message_size_array[index]
      if new_size <= BATCH_SIZE_LIMIT
        current_batch_size += reverse_sorted_message_size_array[index]
        current_batch.push(reverse_sorted_message_size_array[index])
        reverse_sorted_message_size_array.slice!(index)
        index -= 1 # otherwise you skip over elements
      end

      if current_batch.length == BATCH_LENGTH_LIMIT || 
          current_batch_size == BATCH_SIZE_LIMIT
        # time for a new batch!
        batches_so_far.push(current_batch)
        current_batch = []
        current_batch_size = 0
        index = 0
        next
      end
      index += 1
    end

    # We're at the end of the array and there is nothing more to add, new batch!
    batches_so_far.push(current_batch)
    current_batch = []
    current_batch_size = 0
    index = 0
  end

  return batches_so_far
end

Running this gives an output of [[200, 50, 5, 1], [200, 50, 5, 1], [100, 100, 50, 5, 1], [100, 100, 50, 5, 1], [100, 50, 5, 2, 2, 2, 2, 2, 1, 1], [1, 1, 1, 1]], for a total of six batches. This is a significant improvement from the 8 batches needed with the previous algorithm. You can see that the first batches are now packed full until they contain exactly 256 bytes. The second-to-last batch is filled with 10 messages and there are a few small messages left over that get stuffed into the final batch.

This outcome is already much better than the naive algorithm, but we can do better. If we move the “5” and “1” messages from the four first batches to the fifth batch and replace them with 3 “2” messages each, we save enough space in the fifth batch to fit in all the elements from the final batch and we could make do with only 5 batches.

For this example we only cared about the outcome and not the algorithmic efficiency of the data structures and algorithms used. Please don’t delete from the middle of an array in production.

Tree search

Knowing there is a better solution is fine, but manually looking at the data and trying to find improvement is not a scalable algorithm. To provably find the optimal solution we can’t escape having to search through every possible combination of elements. Since for any one message we have five options (one for each batch), the search space grows rapidly as more messages are added. For our relatively small examples with 32 messages, there would already be 5^32 = 23283064365386962890625 different possibilities. Luckily, there are a lot of opportunities to “prune” this search space so we don’t have to search through quite as many options. For example, any combinations with two messages of size 200 will already be invalid so it is not necessary to look at any other elements of that combination; we know it will be invalid no matter what the other elements are. Let’s use depth first search to find a feasible solution:

NUMBER_OF_BATCHES = 5

def split_into_batches_dfs(assignments, message_size_array, sizes_available)
  # if the size has been exceeded this assignment and all
  # recursing from it are infeasible
  return nil if sizes_available.any? {|element| element < 0 }
  # if any bin contains > 10 elements this assignment and all 
  # recursing from it are infeasible
  return nil if assignments.values.any? {|ary| ary.length > 10 }
  return assignments if message_size_array.empty?
  
  newly_assigned_element = message_size_array.last
  resulting_message_size_array = message_size_array[0..-2]

  # Try out all combinations and return the first valid one we find
  (0..(NUMBER_OF_BATCHES-1)).map do |bin_assignment|
    new_assignments = assignments.dup
    new_assignments[bin_assignment] = new_assignments[bin_assignment]
                                       + [newly_assigned_element]
    new_sizes_available = sizes_available.dup
    new_sizes_available[bin_assignment] -= newly_assigned_element 
    result = split_into_batches_dfs(
      new_assignments, 
      resulting_message_size_array, 
      new_sizes_available
    )
    # result is nil only if no good assignment was found
    return result unless result.nil?
  end

  return nil
end

Running this with split_into_batches_dfs(Hash.new([]), message_size_array.sort,[BATCH_SIZE_LIMIT] * NUMBER_OF_BATCHES) gives an output of {0=>[200, 50, 5, 1], 1=>[200, 50, 5, 1], 2=>[100, 100, 50, 5, 1], 3=>[100, 100, 50, 1, 1, 1, 1, 1, 1], 4=>[100, 50, 5, 5, 2, 2, 2, 2, 2, 1]}. This is not the solution we proposed in the previous paragraph but it’s valid all the same. One downside is that we usually do not know in advance how many batches we should use. You could try to find a lower bound and then try successively higher values if no feasible solution can be found, but this is likely to be quite slow.

There are several other approaches we could try if this was not fast enough. The problem has a lot of symmetry: it does not really matter if we take one message with weight 50 or another; SQS does not mind. We could also try various combinations of branch & bound to try and prune the search space by estimating the best possible solution reachable from the current position.

Conclusion

Bin packing is a fascinating problem with huge practical applications. For this reason, it has occupied computer scientists for decades and is still an area of active research. There are a great many sophisticated algorithms that can be applied. The problem remains NP-hard of course, which means that a general solution for finding the optimal value is often not feasible. In general finding an optimal solution is quite expensive; the greedy approach takes only 0.085 seconds on my laptop (even un-optimized), while the tree search approach takes a whopping 1.646 seconds (although it too is not very optimized). Determining whether that is worth it depends on your use case, CPU time is also not free. These times would only diverge further if the problem size grows, since the greedy algorithm is roughly O(number of messages * number of bins) while the tree search takes exponential time in problem size.

If you are interested in these types of problems and are looking for an introductory course, I can recommend the Discrete Optimisation course on Coursera, it covers all these algorithms in-depth and many more. If you have one of these problems on your hands and want help, you can also hire me as a consultant. Finally, whatever you do today don’t forget to make some time for fun.