Data Machine #202 – Data Machine

Generative AI and Vector DBs. Vector DBs have been around for years and have been used in search, NLP and ML applications. But now the boom in LLMs has spawned a new wave of vector DBs. A typical use case is augmenting LLM with a BD vector for “searching”. This sounds simple in theory, but in practice it is not. Accuracy, latency, similarity, scale, and yes…hallucinations are still issues in LLMs. I think it helps to know how vector DBs work before integrating them with LLM.

Update on vector DBs. There is a tendency to lump search engines, search libraries, similarity searches, vector searches, and vector DBs into the same bucket. Maybe it would help to have an update on the vector DBs;

A comprehensive guide to vector databases A nice introduction to vector DBs, basic use cases, and some, but not all, features required by vector DBs.

Vector Databases – A Complete Introduction A great post that describes the differences between vector DBs and vector search libraries, the benefits of vector DBs, and some of the technical challenges.

Not all vector DBs are equal. There are many flavors of vector DBs. Some of them have subtle but important different key features. Here is a list of imo’s most popular vector DBs:

  • Pine gingerVector DB for building high-performance vector search applications

  • KnitNative Vector DB of AI

  • milvus:imo “best” open source vector DB for scalable, fast similarity search

  • chromaAI native open source embeds DB

  • Vespavector, lexical, and structured data retrieval, all with the same query

  • quarterAdvanced, high-performance vector similarity search

  • pgvector:Open Source Vector Similarity Search for PostgreSQL

  • WaldA highly scalable distributed vector search engine

Want to learn more? Read this Detailed comparison of Milvus, Pinecone, Vespa, Weaviate, Vald, GSI and Qdrant.

Similarity search, kNN and ANN algorithms. A key feature of vector DBs is the similarity search, which finds the closest k-vectors to the query vector as measured by a similarity measure. But kNN is computationally expensive. eBay has developed a powerful billion-scale vector similarity engine which uses two algos, HNSW and ScaNN.

Similarity and Distance Measurement in Vector DBs. Vector DBs have built-in distance metrics to calculate similarity. But there are many different distance measures, each with pros and cons. Choose carefully. This is a good read. Advantages and pitfalls of 9 common distance measurements. Interested in hard core distance metrics? payment of distance, discord, and inconsistency.

Vector compression and quantization. Another key feature of vector DBs is to use vector compression techniques to reduce storage space and improve query performance. This is a great post Vector DBs 101 – Scalar Quantization and Product Quantization

Indexing, in-context learning, and vector DBs. In-context learning is a multi-shot learning prompt technique that allows the model to process examples before performing a task. Indexify – a knowledge, memory retrieval and indexing service that facilitates in-context learning for LLMs.. It provides relevant context to SOTA implementation models and pluggable vector stores

Neural search and LLMs require more efficient NLP algos. To address that StanfordNLP released String2String, a new open source library of efficient NLP algorithms for string pair alignment, distance measurement, lexical and semantic retrieval, and similarity analysis. And FacebookAI FAISS: is still one of the best libraries for efficient similarity search and dense vector clustering.

Vector DBs, generative AI content and hallucinations. An example of getting information from Weaviate to prompt a generative model, and then vectorizing and saving your generated content back to the DB. And also how to avoid hallucinatory results. check out Generative backlinks with LLMs for vector databases

Good week!

  1. About NLP-Use Cases. LLM Against Maximalism

  2. AI and the future of programming

  3. “Stop uploading test data in plain text”

  4. Cargo Cult AI

  5. Numbers every LLM developer should know

  6. Large scale uncontrolled anomaly detection @Lyft

  7. LMQL. A programming language for model languages

  8. Brex Quick Engineering Guide. GPT-4 Tips and Tricks

  9. Google Research – Making ML models distinctly private

  10. BLOOMChat 176B. new open multilingual conversation LLM

Share Data Machina with your friends!

  1. OpenLLaMA. Open reproduction of LLaMA

  2. An open source platform for artificial intelligence software developers

  3. SuperAgent – A powerful tool for configuring and deploying LLMs Now

  1. Fair ML models. estimation, tuning and forecasting

  2. Create complex heat maps

  3. Child care and XGBoost model setup with early stopping

  1. BindDiffusion. One diffusion model to tie them all

  2. Data efficient contrast training

  3. Transfer Learning for Computer Vision and CNNs

  1. [wow!] Learn physically simulated tennis skills from videos

  2. The video costs 4096 tokens. zero video comprehension

  3. Tree of thoughts. Deliberate problem solving with LLMs

  1. [awesome!] Visual history. the gap between migrants’ reality and search

  2. histomatic of Magic the Gathering Card game

  3. Mobility exploration in Python with dynamic heatmaps

  1. Elon Musk presented a new video clip of Optimus robot

  2. Autonomous drone navigation and dense forest mapping

  3. Phoenix – A new general purpose humanoid designed for work

  1. [LLMops] MS: guidance – Language for controlling LLM streams

  2. An MLOps template for model inference in AI applications

  3. ML observability in notebook

  1. Hippocrates – Safe, SoTA AI Models for Healthcare

  2. Helicone – Observability for generative AI

  3. Union – A production-grade AI orchestration platform

  1. Shopify Entrepreneurship Index Dataset

  2. Datalab. A Linter for ML Datasets

  3. Actors-HQ. A high-fidelity dataset of clothed people in motion

Did you like this post? Tell your friends about Data Machina. Thanks for reading.

share

advices. recommendations? Feedback: Email Carlos

combines @ds_ldn: at midnight.



Source link