Build LLMs for low-resource languages with SageMaker HyperPod clusters (DEV340)
Summarizing the Video Transcription
Introduction
The presenter is Jackie Chen from Hong Kong, who is the AI Community Builder for machine learning and generative AI, and the CEO of V AI.
He has been giving talks at AWS events in Hong Kong and China for over a year, and this is the first time he's speaking at re:Invent.
Challenges of Training Large Language Models for Low-Resource Languages
There are over 6,000 (some say over 7,000) human languages in the world, and a study shows that over 1,500 languages could be lost in the next 100 years.
Safeguarding human languages as the carrier of intangible cultural heritage is important, and generative AI technologies like large language models or large multimodal models can help preserve cultures, knowledge, and mindsets.
According to multilingual statistics from Wikipedia, English and Chinese are considered high-resource languages, while Cantonese is a low-resource language, ranked 83rd, with limited linguistic resources available for computational processing and analysis.
Despite being a low-resource language, Cantonese is one of the most spoken languages in the world, with over 86 million speakers, ranking top 20, which provides an opportunity to build large language models for low-resource languages.
Data Preparation for Low-Resource Languages
When preparing the training data for low-resource languages, the team faced several challenges, even with official data sources, which may have quality issues.
They gathered various types of data, such as vocabularies from dictionaries, texts from surveys or reviews, news, textbooks, and conversations, from different data sources like social listening, web scraping, and public datasets.
The key part for handling low-resource languages is data processing, which includes data cleaning, data deduplication, and quality control to ensure the dataset is accurate and representative.
They encountered issues with official datasets, such as missing content due to rich metadata, and addressed them by properly extracting all the necessary information, such as pronunciations, geolocations, and background information.
Building the Training Environment
They set up the training environment on AWS using CloudFormation, which includes creating the Custa admin, lifecycle script, S3 bucket, and VPC subnet.
They used the EFS file system, a high-performance file system specifically for handling tasks such as training large language models.
The training was done using the ML training 132X/model, which has 16 Trinium accelerators with 32 neural cores, and the upcoming Trinium 2 was announced during re:Invent, which provides even more powerful architecture.
The training process was automated using a script that includes the Llama training script, which utilizes tensor parallelism and the zero redundancy optimizer.
Model Evaluation and Real-World Applications
The evaluation shows promising results, with the Cantonese large language model achieving the lowest score on the capacity metric, and the team is working on a more sophisticated benchmarking paper to demonstrate the state-of-the-art performance.
They showcased examples of the Cantonese chatbot, which can understand and respond to a mix of Cantonese and English, and the use of the model for sentiment analysis on social media posts and comments.
The team has open-sourced the data, models, and weight, and they are willing to help others build language models for their local languages.
These cookies are used to collect information about how you interact with this website and allow us to remember you. We use this information to improve and customize your browsing experience, as well as for analytics.
If you decline, your information won’t be tracked when you visit this website. A single cookie will be used in your browser to remember your preference.