Internet search engines

Tips, tricks, confidentiality

August 2, 2021 — April 26, 2024

adversarial
computers are awful together
cooperation
faster pussycat
NLP
provenance
search
Figure 1

Finding things on the internet! At one point this felt like a solved problem, but it seems to have gotten unsolved.

Famously, Google seems not to be particularly good at search any longer. Speculative reasons for this include losing the battle to the SEO, or that human-friendly content is being squeezed aside in general, that Google is spending down its credibility in order to bring in advertising revenue, or some other more complicated mechanisms and incentives are just making things terrible or boring.

For some quantifiable data on the theme, see webis-de/ecir24-seo-spam-in-search-engines (Bevendorff et al. 2024).

Regardless of the details or reasons, it does seem to be true for me that search results are bad right now.

In addition, I am uncomfortable with the surveillance and tracking involved in search engines. Insofar as they are the way I access the world, they can potentially know too much about me.

I am interested in solving these problems; the badness of search results, the [skeeziness of search providers(./avoiding_corporate_spying.qmd), and more general knowledge discovery and synthesis problems. Let us see if I got anywhere by reading on.

1 Better commercial search providers

Here are some links to search engines which may reduce the degree of user surveillance, or at least, diffuse the surveillance across a few different players, or provide added value over the classic searches.

Many of these make strong claims to protect user privacy, although few offer substantive guarantees in excess of inspecting tracking headers. Some of them repackage other searches; some run their own indices. Most of them have very unclear business models, which makes me uneasy.

1.1 Kagi

An exception to the opaque-business-model rule is Kagi. Their value proposition is, they claim, to be credibly user-centric:

Kagi has no ads and is fully supported only by its users. We worked very hard to provide high quality, fast and tracking-free results at a minimum cost to ensure sustainability of our operation.

By choosing a paid Kagi plan, you are also helping accelerate our mission of humanising the web.

The free plan is pretty good, and they will happily sell you extra features/more searches:

  • Kagi search features | Kagi Blog

  • No ads

  • Ability to block/boost domains

  • Bangs allow you to quickly jump to all popular sites on the web.

  • zero telemetry, zero tracking

  • See how fast is a website or how many ads/trackers it has before clicking the result.

They have been criticised for being chaos pants. These criticisms to me seem reasonable but not fatal.

Obviously if I become a subscriber, they can in principle track me, so the privacy angle hinges upon some trust.

1.2 Marginalia

File under quirky/quixotic/small web, Marginalia Search:

This is an independent DIY search engine that focuses on non-commercial content, and attempts to show you sites you perhaps weren’t aware of in favour of the sort of sites you probably already knew existed.

The software for this search engine is all custom-built, and all crawling and indexing is done in-house. The project is open source. Feel free to poke about in the source code or contribute to the development!

The search engine is currently serving about 107 queries/minute.

1.3 Startpage

Startpage claims to repackage Google search results AFAIK anonymously, although I cannot see much information about why I should believe them on this. Dutch company. To use them as a search bar in Firefox I needed to add a browser extension, for some tedious reason.

1.4 DuckDuckGo

Perennial favourite, duckduckgo is a search engine run by strident privacy advocates which is laudable I s’pose. The search is… OK. Usually not as good as Google. Every now and again it is serendipitously wonderful, but not reliably.

1.5 Brave

Brave Search recently launched, backed by the creators of the Brave browser. TBC.

1.6 Mojeek

Mojeek/Mojeek Focus (Bookmark) Search Engine

Mojeek was created to provide a globally competitive and genuine alternative search engine based in the UK, and from the outset one that didn’t track its users nor simply retrieve its results from another engine (i.e. to provide real alternative results).

Mojeek’s technology has been developed entirely from scratch by Marc Smith, mostly using the C programming language, and uses no pre-existing search or web crawler technology. All technology and IP is fully owned by Mojeek Limited.

1.7 Qwant

Qwant promises to forget user data rapidly. French company.

1.8 Runnaroo

Similar to Qwant? See runaroo. Promises to aggregate many other search engines and review sites. Their business model is opaque.

1.9 Search encrypt

search encrypt claims additional privacy via encryption in the Perfect Forward Secrecy mode. Presumably this is supposed to prevent them from assembling a history of my searches?

1.10 Suppressing spam in search results

2 DIY search proxies

A.k.a. meta-searching. I suspect these imply maintenance overhead as the search companies attempt to circumvent this circumvention of their business model. Effectively, you would be participating in an arms race.

2.1 searx

The searx family is a network of metasearch engine portals with the aim of protecting the privacy of users. Searx does not share users’ IP addresses or search history with the search engines from which it gathers results. Tracking cookies served by the search engines are blocked etc. The flagship instance is searx.me There are many user-operated instances and it is open source. Advanced: run your own DIY search anonymiser!

2.2 mysearch

mysearch — Local search engine portal designed to anonymise search requests and display search results better. A public instance is available at search.jesuislibre.net. Dead AFAICT.

6 Incoming

7 References

Bevendorff, Wiegmann, Potthast, et al. 2024. Is Google Getting Worse? A Longitudinal Investigation of SEO Spam in Search Engines.” In Advances in Information Retrieval.