Close Menu

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Samsung Tri-Fold Smartphone May Launch in 2025, Pricing Tipped Above $3,000

    May 23, 2025

    WhatsApp Expands Voice Chat Feature to All Group Chats with End-to-End Encryption

    May 23, 2025

    Claude 4 Models by Anthropic, Closer Look at Their Advancements in Reasoning

    May 23, 2025
    Facebook X (Twitter) Instagram Pinterest
    EchoCraft AIEchoCraft AI
    • Home
    • AI
    • Apps
    • Smart Phone
    • Computers
    • Gadgets
    • Live Updates
    • About Us
      • About Us
      • Privacy Policy
      • Terms & Conditions
    • Contact Us
    EchoCraft AIEchoCraft AI
    Home»AI»MLCommons and Hugging Face Unveil Groundbreaking Speech Dataset
    AI

    MLCommons and Hugging Face Unveil Groundbreaking Speech Dataset

    EchoCraft AIBy EchoCraft AIFebruary 1, 2025No Comments4 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    MLCommons, a nonprofit focused on AI safety, and Hugging Face, a leading AI development platform, have joined forces to release one of the world’s largest collections of public domain voice recordings.

    Named Unsupervised People’s Speech, the dataset spans over a million hours of audio in at least 89 languages. This ambitious initiative promises to advance AI research, particularly in the areas of speech recognition, synthesis, and natural language processing.

    Aims and Impact on Speech Technology

    The release of Unsupervised People’s Speech is aimed at driving progress in speech technology, with a particular focus on languages beyond English.

    By offering a vast and diverse collection of speech data, the initiative hopes to enhance AI models, especially for low-resource languages, improving speech recognition across various accents and dialects.

    The dataset is expected to fuel innovations in speech synthesis, making AI technology more accessible worldwide.

    Unforeseen Ethical Concerns

    While the dataset’s potential for AI research is immense, it also raises several ethical issues. One significant concern is the inherent bias within the data.

    A large portion of the recordings comes from Archive.org, a platform predominantly used by English-speaking, American contributors.

    As a result, the dataset is heavily skewed toward American-accented English, which may hinder the performance of AI systems when dealing with non-native English speakers or underrepresented languages.

    Data Ownership and Consent Challenges

    Another pressing issue involves the consent of the individuals whose voices are included in the dataset.

    While MLCommons asserts that all recordings are in the public domain or covered under Creative Commons licenses, questions remain about whether contributors were fully aware their voices would be used for AI research.

    This brings up significant concerns surrounding data privacy and the ethics of using publicly available content without explicit consent, especially when it comes to commercial applications.

    The Difficulty of Opting Out

    The challenge of opting out of AI datasets is another area of concern. Many AI ethics advocates, including Ed Newton-Rex, CEO of Fairly Trained, have criticized current opt-out methods for being unclear and ineffective.

    As generative AI increasingly relies on public domain data for model creation, the responsibility often falls on creators to remove their work from these datasets, even if they were unaware of its inclusion.

    The Scale of the Dataset

    Despite the ethical challenges, the Unsupervised People’s Speech dataset is monumental in scale. With over a million hours of audio across 89 languages, it represents one of the largest public domain voice collections ever assembled for AI research.

    This expansive dataset provides a unique opportunity for advancements in natural language processing and speech synthesis, areas where the availability of high-quality, diverse data is critical.

    Bridging Language Gaps

    The creation of Unsupervised People’s Speech is driven by a desire to democratize AI advancements and address language barriers.

    While English-language speech models have made significant strides, many other languages still lack adequate representation in AI training datasets.

    By focusing on languages with fewer resources, the dataset aims to improve speech recognition and synthesis, especially for global dialects and underserved languages, contributing to more inclusive communication technologies.

    Addressing Bias and Lack of Linguistic Diversity

    Despite its potential, the dataset’s composition raises concerns about AI bias. Given that most of the recordings originate from Archive.org—primarily used by English-speaking Americans—the dataset is predominantly in American-accented English.

    This lack of linguistic and regional diversity could result in AI systems struggling with non-native English speakers or languages that are not well-represented in the dataset.

    Ownership, Consent, and Transparency in AI Data

    Another significant issue revolves around ownership and consent in AI datasets. While MLCommons has ensured that the recordings are either in the public domain or under Creative Commons licenses, doubts persist about whether contributors fully understood the extent of the dataset’s usage.

    Furthermore, recent analyses, such as a report from MIT, highlight a lack of transparency in AI training data, with many publicly available datasets failing to provide clear licensing information.

    The Need for Clear Opt-Out Mechanisms

    The ongoing debate about data ownership is exacerbated by the current challenges in opting out of AI datasets.

    AI ethics advocates argue that the burden of opting out should not fall solely on content creators, as existing methods for removal are often convoluted and difficult to navigate.

    This may result in the unintended inclusion of content that creators did not intend to be used in AI training.

    Moving Forward

    As MLCommons continues to update and maintain Unsupervised People’s Speech, it remains committed to improving the dataset while addressing its inherent biases and ethical concerns.

    Developers are urged to proceed with caution and carefully consider the potential ethical issues when using large-scale, publicly sourced datasets for AI training.

    AI AI safety Innovation MLCommons
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOpenAI’s Reddit (r/ChangeMyView) Experiment Raises Ethical Questions
    Next Article Google Pixel 9a Listed on EMVCo Certification Site, Hints at Imminent Launch
    EchoCraft AI

    Related Posts

    AI

    Claude 4 Models by Anthropic, Closer Look at Their Advancements in Reasoning

    May 23, 2025
    AI

    Mistral Introduces Devstral: An Open-Source Agentic Coding AI for Software Development

    May 22, 2025
    Apps

    Signal’s Windows App Adds Screenshot Blocking to Address Privacy Concerns

    May 22, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Search
    Top Posts

    Samsung Galaxy S25 Rumours of A New Face in 2025

    March 19, 2024367 Views

    CapCut Ends Free Cloud Storage, Introduces Paid Plans Starting August 5

    July 12, 2024140 Views

    Windows 12 Revealed A new impressive Future Ahead

    February 29, 2024121 Views
    Categories
    • AI
    • Apps
    • Computers
    • Gadgets
    • Gaming
    • Innovations
    • Live Updates
    • Science
    • Smart Phone
    • Social Media
    • Tech News
    • Uncategorized
    Latest in AI
    AI

    Claude 4 Models by Anthropic, Closer Look at Their Advancements in Reasoning

    EchoCraft AIMay 23, 2025
    AI

    Mistral Introduces Devstral: An Open-Source Agentic Coding AI for Software Development

    EchoCraft AIMay 22, 2025
    AI

    OpenAI Is Developing a Screenless AI Companion That Could Redefine Personal Technology

    EchoCraft AIMay 22, 2025
    AI

    Google’s AI Agents Are Changing How You Experience the Web

    EchoCraft AIMay 21, 2025
    AI

    Google Released Gemma 3n: AI Model Capable of Running on Mobile Devices

    EchoCraft AIMay 21, 2025

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Stay In Touch
    • Facebook
    • YouTube
    • Twitter
    • Instagram
    • Pinterest
    Tags
    2024 Adobe AI AI agents AI Model Amazon android Anthropic apple Apps ChatGPT Copilot Elon Musk Galaxy S25 Gaming Gemini Generative Ai Google Google I/O 2025 Grok AI India Innovation Instagram IOS iphone Meta Meta AI Microsoft NVIDIA Open-Source AI OpenAI Open Ai PC Reasoning Model Samsung Smart phones Smartphones Smart Watch Social Media TikTok TikTok Ban U.S whatsapp xAI Xiaomi
    Most Popular

    Samsung Galaxy S25 Rumours of A New Face in 2025

    March 19, 2024367 Views

    Apple A18 Pro Impressive Leap in Performance

    April 16, 202463 Views

    Google’s Tensor G4 Chipset: What to Expect?

    May 11, 202446 Views
    Our Picks

    Apple Previews Major Accessibility Upgrades, Explores Brain-Computer Interface Integration

    May 13, 2025

    Apple Advances Custom Chip Development for Smart Glasses, Macs, and AI Systems

    May 9, 2025

    Cloud Veterans Launch ConfigHub to Address Configuration Challenges

    March 26, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • Home
    • Contact Us
    • Privacy Policy
    • Terms & Conditions
    • About Us
    © 2025 EchoCraft AI. All Right Reserved

    Type above and press Enter to search. Press Esc to cancel.

    Manage Consent
    To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
    Functional Always active
    The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
    Preferences
    The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
    Statistics
    The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
    Marketing
    The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
    Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
    View preferences
    {title} {title} {title}