Comments on: Efficient Data Collection with Hash Tables: ArangoDB Insights https://arangodb.com/2015/04/collecting-with-a-hash-table/ The database for graph and beyond Wed, 26 Jun 2024 07:38:46 +0000 hourly 1 https://wordpress.org/?v=6.7.1 By: jsteemann https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-602 Tue, 02 Jun 2015 06:51:00 +0000 http://www.arangodb.com/?p=7420#comment-602 In reply to CoDEmanX.

Thanks for the feedback. A 120x improvement looks good! Props also go to @weinberger for suggesting the hash collect method at all!

]]>
By: CoDEmanX https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-601 Mon, 01 Jun 2015 21:24:00 +0000 http://www.arangodb.com/?p=7420#comment-601 In reply to jsteemann.

I just re-ran my script to aggregate all fields with a minor change to the AQL query to utilize WITH COUNT INTO and on a slightly different data source (negligible though), and it’s 120x faster!!! (from 8 hours down to 4 minutes). I did/do not use indexes, since I would need them on every attribute on all hundred collections. This is an awesome improvement. Thanks, Jan!

]]>
By: CoDEmanX https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-600 Fri, 08 May 2015 15:45:00 +0000 http://www.arangodb.com/?p=7420#comment-600 In reply to jsteemann.

Will there be a 2.6 beta build for windows anytime soon? I’d like to test it sooner than later, but I’m a bit afraid of the windows compilation process. Linux builds are way easier, but would require the source data to be available inside the vm if I’m not mistaken.

]]>
By: jsteemann https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-599 Fri, 08 May 2015 14:19:00 +0000 http://www.arangodb.com/?p=7420#comment-599 In reply to CoDEmanX.

Fingers crossed! Please let us know if performance of your queries improves with ArangoDB 2.6. Thanks!

]]>
By: CoDEmanX https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-598 Tue, 05 May 2015 08:54:00 +0000 http://www.arangodb.com/?p=7420#comment-598 In reply to jsteemann.

Thanks Jan for the in-depth explanation! I should definitely see an improved performance in my case, because there were no indexes at all.

I aggregated all fields taken from a RDBMS, so flat documents with a limited number of fields per collection, but without indexes, since the generation of them would had taken a lot of time and memory (and a script to set them all up, >10k). The time consumption was still acceptable, but should be a matter of seconds instead of hours now.

]]>
By: jsteemann https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-597 Mon, 27 Apr 2015 09:24:00 +0000 http://www.arangodb.com/?p=7420#comment-597 In reply to CoDEmanX.

It is definitely possible though not guaranteed. I’ll try to explain when to expect a speedup and when not.

I understood the goal is to sort by the count values. Then the COLLECT will require a post-SORT by count in the AQL query. This SORT cannot be optimized away and it cannot use any index, because its input values are generated during the grouping phase of the `COLLECT`.
However, if the input values to the COLLECT repeat a lot (as you mentioned), the output of the COLLECT should be much smaller than its input. And the smaller the output of the COLLECT is, the less expensive a SORT on it will become.

That means if your query is something like `FOR doc IN collection COLLECT value = doc.value WITH COUNT INTO count SORT count DESC RETURN { value: value, count: count }` and there are many repeating values in `doc.value` and no sorted index on this attribute, the query can employ the hash table collect and run faster than before, when it required the input of the COLLECT to be sorted by `doc.value`. This sorting would have been the most expensive part of the query, and the goal of the hash table collect is to avoid it. So yes, here you should see a speedup.

When there is already a sorted index on the group criterion (i.e. `doc.value` in the previous query), then the optimizer will likely use it to avoid the pre-COLLECT sorting altogether. In this case, it won’t use the hash table collect and query execution time shouldn’t change too much (at least not directly related to COLLECT and SORT, we have applied some other optimizations elsewhere that might still speed it up).

Finally, if after a COLLECT there are as much output values as there are input values, then sorting by the final count values will still be expensive (though unavoidable). But you might still save the pre-COLLECT sort due to the optimizer using the hash table version of COLLECT if there is no index on the group criterion.

]]>
By: CoDEmanX https://arangodb.com/2015/04/collecting-with-a-hash-table/#comment-596 Fri, 24 Apr 2015 14:08:00 +0000 http://www.arangodb.com/?p=7420#comment-596 Will there be a performance gain if you use WITH COUNT INTO and sort by count on repeating numbers and strings?

]]>