AI search
February 6, 2024 — February 10, 2025
Suspiciously similar content
Placeholder on AI search. Retrieval-augmented generation etc. Vector databases.
Theory and practice.
1 This blog has AI indexing
Those “suspiciously similar posts” links at the top of the page are generated by an AI model that indexes the text of the posts and gives each a topic embedding. This is a naïve but pretty easy and effective way of finding similar posts.
The script is open source. You can download it from similar_posts_static_site.py
Non-obvious discovery: I auditioned two AI models for the task, nomic-ai/nomic-embed-text-v1.5 and mixedbread-ai/mxbai-embed-large-v1. The latter gave more intuitively correct results even though it ignores most of the post, looking only at the first 512 tokens, which is basically the title, categories and a paragraph or two. nomic
looks at most of the post, with 8192 tokens, but that seemed to produce generally worse results.
The Algolia search that provides the search box at the top of the page is presumably similar, but that is run by a third party who serves the content cleverly from the servers, but with less control for me.
2 Retrieval-augmented generation
TBD
3 Tools
3.1 Commercial services searching the internet
See internet search.
3.2 Free/FOSS-ish:
- ChromaDB is a vector database with a focus on search and retrieval. I used it to store vector embedding to note “similar posts” on this site and I can report it was incredibly simple for my use case.
- nilsherzig/LLocalSearch: LLocalSearch is a completely locally running search aggregator using LLM Agents. The user can ask a question and the system will use a chain of LLMs to find the answer. The user can see the progress of the agents and the final answer. No OpenAI or Google API keys are needed.
- nashsu/FreeAskInternet: FreeAskInternet is a completely free, PRIVATE and LOCALLY running search aggregator & answer generator using MULTI LLMs, without GPU needed. The user can ask a question and the system will make a multi-engine search and combine the search result to LLM and generate the answer based on search results. It’s all FREE to use.
3.3 Others
4 Incoming
- LlamaIndex - Build Knowledge Assistants over your Enterprise Data
- Getting Started - Chroma Docs
- Roaming RAG – RAG without the Vector Database - Arcturus Labs
- On nomic embeddings: Introducing Nomic Embed: A Truly Open Embedding Model
- mxbai embeddings: Open Source Strikes Bread - New Fluffy Embedding Model - Blog | Mixedbread