AI search

February 6, 2024 — February 10, 2025

boring

computers are awful together

faster pussycat

incentive mechanisms

mind

NLP

provenance

wonk

Suspiciously similar content

Placeholder on AI search. Retrieval-augmented generation etc. Vector databases.

Theory and practice.

1 This blog has AI indexing

Those “suspiciously similar posts” links at the top of the page are generated by an AI model that indexes the text of the posts and gives each a topic embedding. This is a naïve but pretty easy and effective way of finding similar posts.

The script is open source. You can download it from similar_posts_static_site.py

Non-obvious discovery: I auditioned two AI models for the task, nomic-ai/nomic-embed-text-v1.5 and mixedbread-ai/mxbai-embed-large-v1. The latter gave more intuitively correct results even though it ignores most of the post, looking only at the first 512 tokens, which is basically the title, categories and a paragraph or two. nomic looks at most of the post, with 8192 tokens, but that seemed to produce generally worse results.

The Algolia search that provides the search box at the top of the page is presumably similar, but that is run by a third party who serves the content cleverly from the servers, but with less control for me.

2 Retrieval-augmented generation

TBD

3 Tools

3.1 Commercial services searching the internet

See internet search.

3.2 Free/FOSS-ish:

3.3 Others

4 Incoming

5 References

Es, James, Espinosa Anke, et al. 2024. “RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations.

Fan, Ding, Ning, et al. 2024. “A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models.” In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. KDD ’24.

Gao, Xiong, Gao, et al. 2024. “Retrieval-Augmented Generation for Large Language Models: A Survey.”

Venkit, Laban, Zhou, et al. 2024. “Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses.”