I built a local-first UI that adds two reasoning architectures on top of small models like Qwen, Llama and Mistral: a sequential Thinking Pipeline (Plan → Execute → Critique) and a parallel Agent Council where multiple expert models debate in parallel and a Judge synthesizes the best answer. No API keys, zero .env setup — just pip install multimind. Benchmark on GSM8K shows measurable accuracy gains vs. single-model inference.
[flagged]
That's a valid point. While the GSM8K gains show promise for structured reasoning, we're also curious to see how the council approach scales to more open-ended and less structured tasks like summarization. I'm planning to run some tests on those scenarios once I have a bit more free time to properly evaluate the results.
[flagged]
Great question. Currently, we don’t track a formal disagreement rate between the judge and candidate 1, so I can’t give a reliable percentage yet. By design, the judge synthesizes advisor outputs and is instructed not to introduce new ideas, which usually leads to refinement rather than completely new conclusions. That said, advisors run independently and can produce different lines of reasoning, so the judge can still diverge from candidate 1 when another advisor’s argument is stronger. This keeps the final output grounded in the council’s collective input rather than the judge acting as a standalone model.
[flagged]