# SLMs

Specialised and Small Language Models (SLMs) including information on alternative approaches to transformers

# Less is More: Recursive Reasoning with Tiny Networks

Paper: [https://arxiv.org/html/2510.04871v1](https://arxiv.org/html/2510.04871v1)

### Abstract

[Hierarchical Reasoning Model (HRM)](https://wiki.jamesravey.me/books/ai-and-ml/page/hierarchical-reasoning-model "Hierarchical Reasoning Model") is a novel approach using two small neural networks recursing at different frequencies. This biologically inspired method beats Large Language models (LLMs) on hard puzzle tasks such as Sudoku, Maze, and ARC-AGI while trained with small models (27M parameters) on small data (∼ 1000 examples). HRM holds great promise for solving hard problems with small networks, but it is not yet well understood and may be suboptimal. We propose Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers. With only 7M parameters, TRM obtains 45% test-accuracy on ARC-AGI-1 and 8% on ARC-AGI-2, higher than most LLMs (e.g., Deepseek R1, o3-mini, Gemini 2.5 Pro) with less than 0.01% of the parameters.

# Mistral Small 3.2

<span style="white-space: pre-wrap;">Mistral Small is a 24B param LLM that </span>

### Running in Ollama

```bash
ollama pull hf.co/gabriellarson/Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q4_K_M
```

### References

- [https://simonwillison.net/2025/Jun/20/mistral-small-32/](https://simonwillison.net/2025/Jun/20/mistral-small-32/)

# Hierarchical Reasoning Model

**Paper URL:** [https://arxiv.org/pdf/2506.21734](https://arxiv.org/pdf/2506.21734)   
**Code Repo:** [https://github.com/sapientinc/HRM](https://github.com/sapientinc/HRM)

HRM is an alternative to transformer architecture that is better able to reason. It outperforms transformer-based LLMs at ARC-AGI2 with only 27M parameters.

### Training a 27M Parameter Model with 1000 Examples

In the paper the authors refer to the fact that they only use between 1000 and 10,000 examples for specific problem domains:

- **Sudoku-Extreme**: 1000 training examples (used in main experiments)
- **Sudoku-Extreme-Full**: ~10,000 examples (used in analysis experiments for convergence guarantees)
- **ARC-AGI**: ~1000 examples from the official dataset, heavily augmented with translations, rotations, flips, and color permutations

This may seem quite low considering that this is a 27M parameter neural network and it seems likely that the network would be underfit after so few examples. The authors provide some additional clarifications around this point:

1. Data augmentation is used in order to functionally boost the size of the training set.
2. The authors use deep supervision to augment the training process (rather than relying on back-propagation alone).
3. The problem domain is simpler than for language - particularly for things like Sudoku and ARC-AGI - these are structured grid type problems.

# Gemma 4

# Gemma 4

Released **March 31, 2026** by Google DeepMind. Apache 2.0 licensed. Multimodal (text + image, audio on small models).

## Model Sizes

| Model | Type | Effective Params | Context | Modalities |
|--------|------|-----------------|---------|------------|
| **E2B** | Dense | 2.3B (5.1B w/ embeddings) | 128K | Text, Image, Audio |
| **E4B** | Dense | 4.5B (8B w/ embeddings) | 128K | Text, Image, Audio |
| **26B A4B** | MoE | 3.8B active / 25.2B total | 256K | Text, Image |
| **31B** | Dense | 30.7B | 256K | Text, Image |

The **26B A4B** is the standout — a MoE model that runs almost as fast as a 4B model despite 26B total params. The **E2B/E4B** use Per-Layer Embeddings for on-device efficiency.

## Local Running Options

1. **Ollama** — `ollama run google/gemma-4` (all sizes). Easiest one-command setup.
2. **llama.cpp** — GGUF quantized versions available on Hugging Face. Good for CPU/GPU hybrid inference.
3. **vLLM** — For higher-throughput server deployment. Supports the native HF safetensor weights.
4. **LM Studio** — GUI-based, supports GGUF formats. Good for desktop use.
5. **Hugging Face Transformers** — Direct Python API. Full precision or QLoRA fine-tuning.

## Hardware Requirements (rough)

- **E2B (2.3B eff.)** — Runs on phones, any modern laptop (4-8 GB RAM)
- **E4B (4.5B eff.)** — 8-16 GB RAM, most 2024+ MacBooks
- **26B A4B** — 16-24 GB VRAM (single GPU), or CPU with enough RAM
- **31B** — 24-48 GB VRAM (A100/H100 recommended), or multi-GPU

The **26B A4B** is generally considered the sweet spot for local use — frontier-level benchmarks (82.6 MMLU Pro, 88.3 AIME) with ~4B active parameter compute cost.

All models are on Hugging Face under `google/gemma-4-*`.