🚀 Building and Training DeepSeek from Scratch for Children's Stories
Prashant Lakhera

Prashant Lakhera @lakhera2015

About: 29X Certified, AWS Community Builder, Ex-Redhat, Author, Blogger, YouTuber, RHCA, RHCDS, RHCE, Docker Certified,5X AWS, CCNA, MCP, Certified Jenkins, Terraform Certified, 1X GCP,CKA,CKAD,CKS,KCNA

Joined:
Nov 19, 2022

🚀 Building and Training DeepSeek from Scratch for Children's Stories

Publish Date: Jun 20
0 0

A few days ago, I shared how I trained a tiny 30-million-parameter model“Trained a Tiny Model to Tell Children's Stories!” https://www.linkedin.com/posts/prashant-lakhera-696119b_ai-genai-tinyml-activity-7340544698115112960-PcAn, based on the GPT-2 architecture. Thank you all for the overwhelming response!

Since GPT-2 has already been extensively explored, I’m excited to take things further.

🚀 Introducing DeepSeek-Children-Stories, a purpose-built model that leverages DeepSeek’s advanced architecture (MLA + MoE + Multi-token prediction) to generate creative children’s stories with just ~15–18M parameters.

🔥 And the best part? With just a single command, setup.sh you can automatically pull the dataset, train the model, and get everything running end-to-end without hassle.

📌 Why I Built It
Large language models are powerful, but they are often resource-intensive. I wanted to explore:
✅ Can DeepSeek's cutting-edge architecture be adapted for niche storytelling tasks?
✅ Can a model this small still create engaging and high-quality content?

📌 What’s Inside
Advanced Architecture:
✅ Multihead Latent Attention (MLA): Efficient attention mechanism with shared key-value heads
✅ Mixture of Experts (MoE): 4 experts with top-2 routing to boost capacity
✅ Multi-token Prediction: Predicts the next 2 tokens simultaneously for faster inference
✅ Rotary Positional Encodings (RoPE): Better positional understanding

Training Pipeline:
✅ Dataset: 2,000+ high-quality children's stories from Hugging Face
✅ Tokenizer: GPT-2 tokenizer for broader compatibility
✅ Training: Mixed precision with gradient scaling
✅ Optimization: PyTorch 2.0 compilation for speed

❓ Why Build From Scratch?
Why go through the extra effort of implementing DeepSeek’s architecture instead of fine-tuning an existing model?
✅ Fully customize the architecture for storytelling
✅ Integrate state-of-the-art components like MLA and MoE
✅ Minimize inference cost and environmental impact
✅ Deeply understand how modern model architectures function

💡 If you’re looking for a single tool to simplify your GenAI workflow—including MCP integration—check out IdeaWeaver, your one-stop CLI for Generative AI.
🔗 Docs: https://ideaweaver-ai-code.github.io/ideaweaver-docs/
🔗 GitHub: https://github.com/ideaweaver-ai-code/ideaweaver
🔗 GitHub Repo: github.com/ideaweaver-ai/DeepSeek-Children-Stories-15M-model

⭐ Star it if you believe Advanced Architecture + Tiny Models = Big Possibilities!
🔗 Try the model:
https://huggingface.co/lakhera2023/deepseek-children-stories

⭐ If you believe Tiny Models can do Big Things, give it a star!

Comments 0 total

    Add comment