NVIDIA Optimizes Google’s Gemma LLM For AI Accelerators & RTX AI PCs

NVIDIA has recently accelerated developments in optimizing industry-standard LLMs with its AI-infused RTX GPUs, as the firm now enhances Google’s cutting-edge Gemma model.

NVIDIA Enhances TensorRT-LLM & Multiple Software Resources To Provide Cost-Efficient Performance With Its AI & RTX GPUs

[Press Release]: NVIDIA, in collaboration with Google, today launched optimizations across all NVIDIA AI platforms for Gemma — Google’s state-of-the-art new lightweight 2 billion– and 7 billion-parameter open language models that can be run anywhere, reducing costs and speeding innovative work for domain-specific use cases.

Teams from the companies worked closely together to accelerate the performance of Gemma — built from the same research and technology used to create the Gemini models — with NVIDIA TensorRT-LLM, an open-source library for optimizing large language model inference, when running on NVIDIA GPUs in the data center, in the cloud and on PCs with NVIDIA RTX GPUs.

This allows developers to target the installed base of over 100 million NVIDIA RTX GPUs available in high-performance AI PCs globally.

Developers can also run Gemma on NVIDIA GPUs in the cloud, including on Google Cloud’s A3 instances based on the H100 Tensor Core GPU and soon, NVIDIA’s H200 Tensor Core GPUs — featuring 141GB of HBM3e memory at 4.8 terabytes per second — which Google will deploy this year.

Enterprise developers can additionally take advantage of NVIDIA’s rich ecosystem of tools — including NVIDIA AI Enterprise with the NeMo framework and TensorRT-LLM — to fine-tune Gemma and deploy the optimized model in their production application.

Gemma Coming to Chat With RTX

Adding support for Gemma soon is Chat with RTX, an NVIDIA tech demo that uses retrieval-augmented generation and TensorRT-LLM software to give users generative AI capabilities on their local, RTX-powered Windows PCs. Chat with RTX lets users personalize a chatbot with their data by easily connecting local files on a PC to a large language model.

Since the model runs locally, it provides results fast, and user data stays on the device. Rather than relying on cloud-based LLM services, Chat with RTX lets users process sensitive data on a local PC without the need to share it with a third party or have an internet connection.

[Journalist Note]: The optimization of Google’s Gemma holds one pretty exciting factor, which is that NVIDIA has introduced enhancements to make the model more optimized for their consumer RTX GPUs as well, which is an excellent step towards enabling developers to work without the need of high-end equipment such as dedicated AI GPUs.

This aspect was debated in modern times, as it was perceived that manufacturers are shifting the “development” side towards specific GPUs due to the non-existence of adequate libraries and resources for developers, but it seems like NVIDIA is trying to bring everyone on board here, which is a much-appreciated step.

News Source: NVIDIA Blog

Share this story

Facebook

Twitter