Unlocking Efficient AI with Smaller, Smarter Neural Networks
Reduce model size while maintaining accuracy. Our neural network pruning techniques improve AI efficiency, lower computational costs, and speed up inference—making machine learning models more lightweight and effective.
Contact UsOur Services
Smarter, not just bigger – We can reduce model size by 50% or more while maintaining performance.
LLM Optimization
We specialize in optimizing large language models by significantly reducing their size — often by 50-90% — with minimal perplexity degradation. This makes powerful AI models more efficient, affordable, and easier to deploy across a wide range of platforms. This translates to 50-80% inference cost reduction.
Mobile deployment
We enable AI models to run efficiently on mobile and edge devices by reducing their memory and storage footprint. Our solutions ensure that even resource-constrained environments can benefit from advanced neural networks.
Other
Not seeing what you need? If your use case doesn’t fit neatly into these categories, reach out anyway. We’re always open to exploring new challenges and tailoring solutions to fit unique AI needs.
News
Our team introduced a new matrix storage format and multiplication algorithm that finally enables practical speedups and memory savings from sparse neural networks. Although sparsity has been known for decades to preserve model performance, it historically offered little benefit on GPUs without custom hardware or significant accuracy loss.
Our approach delivers meaningful reductions in memory use and faster inference even at low sparsity levels, without constraints on sparsity structure or GPU capabilities. It applies directly to all LLMs using standard linear layers.
Our founder, Vladimir Macko, took part in a special edition of the BETTER_AI_MEETUP in collaboration with prg.ai on 13 November during Dny AI. The event focused on robustness and efficiency in large-scale AI and featured talks from Vladimir and Stanislav Fort. The session was streamed online and supported by the Slovak Diaspora Project within Slovaks.ai and the lorAI Project.
Our team members Vladimir Macko and Vladimir Boza published a new research paper on arXiv presenting an effective method for compressing large language models. Their work showed that a Llama2-7B–level model can be reduced by up to a factor of ten while running up to 2.5× faster, making it practical for embedded devices, mobile hardware, and other low-resource environments. In testing, the compressed model reached 78 tokens per second on a seven-year-old GPU. Code and model releases are planned.
Our founder, Vladimir Macko, delivered the closing talk at this year’s Machine Learning Prague conference, one of Europe’s leading industrial ML events. His presentation, “Fitting LLMs into a Single GPU: Making Neural Networks Smaller,” explored advanced techniques for compressing large language models while maintaining performance.
He also highlighted real examples of how our customers are already reducing model size to simplify deployment, lower hosting costs, and improve speed. The session focused on practical approaches to cutting inference expenses and making state-of-the-art AI more accessible to organizations of all sizes.
At the first day of ICLR 2025 in Singapore, our team members Vladimir Macko and Vladimir Boza presented their work on reducing the size of machine learning models. Their poster attracted steady interest and discussion for several hours from both academic and industry attendees.
ICLR 2025 registered more than 8,000 participants and over 3,600 papers, making it one of the largest ML conferences to date. The event also featured a small but active Slovak presence, with Comenius University as the only Slovak institution represented.
Our Results
LLM Optimization
Pruned Llama2-70B from 140GB to 26GB (with small perplexity degradation from 3.12 to 3.76).
Mobile Deployment
Pruned a 120MB base model to 6.2MB.
Meet the Team
Vladimír Macko
Founder
Vladimír Boža
Chief Scientific Officer
Let’s Talk!
Optimizing neural networks isn’t just a technical improvement—it’s a business imperative.
Contact UsAbout our company
At GrizzlyTech, we specialize in optimizing machine learning models through advanced neural network pruning techniques. Our mission is to help businesses create faster, more efficient AI solutions by reducing model size without sacrificing accuracy. With our cutting-edge methods, we improve computational efficiency, lower operational costs, and accelerate inference times, making AI models more lightweight and powerful. Led by a team of top machine learning experts, we’re dedicated to pushing the boundaries of AI optimization for a smarter, more sustainable future.