By: Alex Geenen

Alex Geenen — Tue, 06 Jul 2021 13:44:38 +0000

In reply to Fabio Mencoboni. Hi Fabio,

If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.

Yes that's correct!

I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?

Yes, ArangoSearch allows you to perform tokenization and full-text search directly in the database. At this point, word embeddings aren't directly supported, which is what this tutorial lets you do. ArangoSearch does support vector space models such as BM-25 and TF-IDF for scoring search results. Please see here if you want to learn more about them.

The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?

Great question! The answer is that it depends. If you're querying a single server, it will use a sequential scan (so a single thread). If you're querying a collection on a cluster, and the collection is sharded across different servers, then there will be concurrency at a database server level, but within those server processes it will also be scanned sequentially.

By: Fabio Mencoboni

Fabio Mencoboni — Fri, 02 Jul 2021 12:24:49 +0000

Very cool tutorial- thanks for sharing. I am really excited about using ArangoDB with Semantic queries, and this is a great overview. A couple questions:
* If I understand correctly, this approach is using the DistillBERT model in python to calculate embeddings for documents which are then stored in ArangoDB.
* I have seen elsewhere the use of ArangoSearch, which I think did tokenization and embedding directly in the database. Do I understand the difference between these approaches correctly?
* The query uses the expression below to calculate the dot-product of the query embedding to document embedding. This implies a slower single-thread approach, though if ArangoDB is calculating this value for multiple documents concurrently under the hood it would still get the benefit of multi-core processors. Any thoughts/comments on performance?
LET numerator = (SUM(
FOR i in RANGE(0,767)
RETURN TO_NUMBER(NTH(descr_emb, i)) * TO_NUMBER(NTH(v.word_emb, i))
))

Comments on: Word Embeddings in ArangoDB

By: Alex Geenen

By: Fabio Mencoboni