Announcing our new article from Jack Morris!

Article Preview:

… Now imagine you’re a software engineer building a RAG system for your company. You decide to store your vectors in a vector database. You notice that in a vector database, what’s stored are embedding vectors, not the text data itself. The database fills up with rows and rows of random-seeming numbers that represent text data but never ‘sees’ any text data at all. You know that the text corresponds to customer documents that are protected by your company’s privacy policy.

But you’re not really sending the text off-premises at any time, you only ever send the vectors, which look like random numbers. What if someone hacks into the database and gains access to all your text embedding vectors – would this be bad? Or if the service provider wanted to sell your data to advertisers – could they? Both scenarios involve being able to take embedding vectors and invert them somehow back to text

The problem of recovering text from embeddings is exactly the scenario we tackle in our paper Text Embeddings Reveal As Much as Text (EMNLP 2023). Are embedding vectors a secure format for information storage and communication? Put simply: can input text be recovered from output embeddings?

Continue Reading ->

Read More in  The Gradient