Full LLM training and evaluation toolkit

(github.com)

247 points | by testerui 3 days ago

7 comments

  • timhigins 3 days ago

    Might be worth updating the title to "SmolLM: state-of-the-art small language model trained on open datasets" (See the first table of https://huggingface.co/blog/smollm for benchmarks)

    It was fascinating digging into this to find their dataset weights defined in a declarative YAML file [2]. 70% is from FineWeb/Commoncrawl but filtered using a classifier trained on Llama-70b's rating from 0-5 of the educational content of the text [3]. This is something we know small models like Phi-3 have been doing for a while, but it's great to see a fully open reproduction of it that beats their benchmarks. Definitely supports the idea you can get even better reasoning at smaller model sizes by carefully filtering and curating your training data (and generating good synthetic data from/distilling bigger models).

    You can see the 450k Llama educational value scores here: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu-ll... It's interesting, I think the text with 3 scores is really good, but the 5 scores pick content that is not very reasoning or information-heavy but just mentions education or a worksheet. For SmolLM they just took the documents with scores >= 3 so it doesn't matter a ton.

    2. https://github.com/huggingface/smollm/blob/9efce803bc7e37727... 3. https://huggingface.co/HuggingFaceFW/fineweb-edu-classifier

    • timhigins 3 days ago

      Update: While SmolLM was SOTA at the time of release in July, SmolLM 2 1.7B (which is the newest release) is not currently the best model under 2B params on https://huggingface.co/spaces/open-llm-leaderboard/open_llm_...

    • pixelart34 3 days ago

      Exactly, love the openness about the training data which none of the big labs disclose, the v2 apparently uses a better mix according to the model card and scores

  • abeppu 3 days ago

    While it's great that this is open source, and I understand the pressure for smaller models that can be run in a wider range of contexts, I continue to be annoyed that authors keep posting comparisons to models which are slightly smaller.

    In this page, SmolLM2-1.7B does a bit better than Qwen2.5-1.5B which is ahead of Llama3.2-1B. At the next size level up, in other comparisons I've seen that e.g. Phi-3.5 (which is ~3.8B params) does a bit better than Llama 3.2 3B. Gemma 2 has a 9B size, llama 3.1 has an 8B size and I think when that came out Mistral had a 7B model -- so whenever a new "small" thing does "better" than its peers, we can't easily see whether it's because of any of the many small choices that the authors made were actually better.

  • bashfulpup 3 days ago

    Pythia is stupidly easy to use.

    Then hookup a simple test harness. - this is like a grand total of 3 commands - git pull, install, point and run a model

  • pg5 2 days ago

    It seems that SmolLM has some issues with forming grammatically correct sentences for longer responses.

  • 2 days ago
    [deleted]