Innocent looking array_unique – 2 Stories of performance hogs in Shopware 6 and Tideways own backend code base

This post tells a story of a performance mistake that is quickly made even by experienced developers: The expectation that a built-in PHP function has better performance than a better suited data structure written in PHP.

The protagonist in these stories is array_unique, a function that takes a list of values and removes duplicate entries to return a list of each value occurring only once.

Take these two code examples, the first from Shopware 6.3:

$this->tags = array_unique(array_merge($this->tags, array_values($tags)));

The second one from our own Tideways backend code:

$versions[$row['version']]['organizations'][] = $row['organization_id'];
$versions[$row['version']]['unique'] = count(array_unique(
    $versions[$row['version']]['organizations']
));

On a first glance these snippets look innocent: append new values to an array and then make sure they aren’t duplicates of previously known values by applying array_unique.

With the help of Tideways Profiler, we found both are executed in loops and perform the unique operation many thousands of times.

The algorithmic complexity of array_unique was improved in PHP 7.2, but it is still “just” O(n), meaning for larger arrays the operation can take a while. Here is the code of array_unique translated to how it would look like in PHP:

function array_unique(array $values) {
    $seen = [];
    $uniqueValues = [];
    
    foreach ($values as $value) {
        if (!isset($seen[$value])) {
            $uniqueValues[] = $value;
            $seen[$value] = true;
        }
    }
    
    return $uniqueValues;
}

This is an efficient enough algorithm, but only if you don’t repeatedly do the operation over and over again. There are two solutions to this performance problem:

Calculate array_unique only once at the end of the loop, while appending duplicates to the array during the loop. This strategy is simple, but might cause memory problems if the array is very large or the elements are long strings.
Incrementally calculate uniqueness by keeping the array of $seen values around over all merges of new elements.

The second solution is easy to implement with PHP Arrays by storing the values as keys and not appending them as values. Technically this means using the $seen array from the code example as data structure:

Both code examples from the beginning of the post can be rewritten by storing the values as array keys:

foreach ($tags as $tag) {
     $this->tags[$tag] = true;
}

The list of unique tags is then fetched using array_keys($this->tags) later. We found that this change was already made from Shopware 6.3 to 6.4 in the following commit.
We rewrote our own Tideways backend code to make use of the same approach, using array keys:

$versions[$row['version']]['organizations'][$row['organization_id']] = true;
$versions[$row['version']]['unique'] = count(
    $versions[$row['version']]['organizations']
);

This benefits from using the O(1) complexity of adding an element to an array (hashmap) and array_unique is not needed anymore.

Looking at a comparison of both optimizations in Tideways Callgraph Profiler shows how massive the gains are for these changes.

In Shopware 6.3, we replaced the old CacheTagCollection code with the code from 6.4 and the performance improved from 7,3 minutes to 51 seconds for the Product Export, an improvement of 88% of which 4,08 minutes (54%) can be attributed to the removal of array_unique and another 40 seconds to the use of array_merge in the same line. Conclusion: Upgrade to Shopware 6.4!

In the case of Tidways own backend code, removing array_unique reduces 925ms from the original request that takes 1,87 seconds, a rough 50% improvement.

And here is the moral of our story: array_unique should generally be called for any array only once, not over and over again.

Intrigued by PHP performance story-telling? We are currently looking for a Developer Advocate to write about bottlenecks just like this.

Benjamin 25.05.2021