Neural nets that do symbolic mathematics, logic and other reasoning tasks

December 8, 2019 — March 11, 2025

compsci

language

machine learning

meta learning

networks

neural nets

NLP

stringology

Suspiciously similar content

Somewhere between computational symbolic mathematics and automated proof assistants and the modern large language models are models that can solve mathematical problems more effectively than my feeble brain.

cf differentiable automata learning.

Watch this space.

1 Test time scaling

Getting models to self critique is unreasonably effective.

2 Incoming

simplescaling/s1: s1: Simple test-time scaling (Muennighoff et al. 2025)

Test-time scaling is a promising new approach to language modelling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at this https URL
We may finally crack Maths. But should we?
Improving Mathematical Reasoning with Process Supervision
FranxYao/chain-of-thought-hub: Benchmarking large language models’ complex reasoning ability with chain-of-thought prompting/ Towards Complex Reasoning: the Polaris of Large Language Models (Fu et al. 2023)
sanderwood/bgpt: Beyond Language Models: Byte Models are Digital World Simulators (Wu et al. 2024)
Neural algorithmic reasoning
Unveiling Transformers with LEGO (Zhang et al. 2022)

3 References

Akyürek, Damani, Qiu, et al. 2024. “The Surprising Effectiveness of Test-Time Training for Abstract Reasoning.”

Bansal, Hosseini, Agarwal, et al. 2024. “Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling.”

Bubeck, Chandrasekaran, Eldan, et al. 2023. “Sparks of Artificial General Intelligence: Early Experiments with GPT-4.”

Clark, Tafjord, and Richardson. 2020. “Transformers as Soft Reasoners over Language.” In IJCAI 2020.

Dehghani, Gouws, Vinyals, et al. 2019. “Universal Transformers.”

Fu, Ou, Chen, et al. 2023. “Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models’ Reasoning Performance.”

Garcez, and Lamb. 2020. “Neurosymbolic AI: The 3rd Wave.”

Hao, Sukhbaatar, Su, et al. 2024. “Training Large Language Models to Reason in a Continuous Latent Space.”

Kiciman, Ness, Sharma, et al. 2024. “Causal Reasoning and Large Language Models: Opening a New Frontier for Causality.” Transactions on Machine Learning Research.

Kwa, West, Becker, et al. 2025. “Measuring AI Ability to Complete Long Tasks.”

Lample, and Charton. 2019. “Deep Learning for Symbolic Mathematics.” arXiv:1912.01412 [Cs].

Mahowald, Ivanova, Blank, et al. 2024. “Dissociating language and thought in large language models.” Trends in Cognitive Sciences.

Mirzadeh, Alizadeh, Shahrokhi, et al. 2024. “GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models.”

Muennighoff, Rush, Barak, et al. 2023. “Scaling Data-Constrained Language Models.” Advances in Neural Information Processing Systems.

Muennighoff, Yang, Shi, et al. 2025. “S1: Simple Test-Time Scaling.”

Radford, Wu, Child, et al. 2019. “Language Models Are Unsupervised Multitask Learners.”

Schuurmans, Dai, and Zanini. 2024. “Autoregressive Large Language Models Are Computationally Universal.”

Wang, Wei, Schuurmans, et al. 2023. “Self-Consistency Improves Chain of Thought Reasoning in Language Models.”

Wu, Tan, Wang, et al. 2024. “Beyond Language Models: Byte Models Are Digital World Simulators.”

Ye, Gong, Chen, et al. 2024. “Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models.” In.

Yu, Xu, Weston, et al. 2024. “Distilling System 2 into System 1.”

Zhang, Backurs, Bubeck, et al. 2022. “Unveiling Transformers with LEGO: A Synthetic Reasoning Task.”