Text processing
April 25, 2015 — July 13, 2016
Information retrieval via string metrics. Speech tagging. Vector spaces induced by document structures, such as cosine similarity and word2vec
style embeddings.
Metrics based on generation by finite state machines. Maybe co-occurrence metrics would also be useful as musical metrics? Inference complexity.
If I were to actually write this entry, it would be a big research project.
1 Software
-
“Lucene is an Open Source, mature and high-performance Java search engine. It is highly flexible, and scalable from hundreds to millions of documents.
Luke is a handy development and diagnostic tool, which accesses already existing Lucene indexes and allows you to display and modify their content in several ways…”
-
“Whoosh is a fast, featureful full-text indexing and searching library implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part of how Whoosh works can be extended or replaced to meet your needs exactly.”