A recent paper from LG AI Research reveals that 80% of open datasets used for training AI models may pose legal risks due to hidden copyrighted material and licensing terms.
The paper suggests implementing AI-based compliance agents to scan dataset histories for legal issues faster and more accurately than human lawyers.
Only 21% of datasets labeled as commercially usable were deemed legally safe for commercialization after in-depth analysis.
Companies developing AI models are facing challenges in navigating uncertain legal landscapes regarding dataset copyright and licensing.
Transparency around dataset sources is becoming a critical issue, with concerns arising about hidden copyrighted data in training datasets.
Initiatives are emerging to ensure license compliance in datasets, but the new research indicates errors and uncertainties in dataset licenses.
The Nexus Data Compliance framework proposed in the paper leverages AI-driven tools like AutoCompliance to assess legal risks and compliance across dataset dependencies.
AutoCompliance demonstrated superior accuracy and efficiency in identifying dependencies and license terms compared to human experts, highlighting its potential in ensuring dataset compliance.
Dataset investigation revealed numerous cases of non-compliant dataset redistribution, both explicitly prohibited and involving conflicting license conditions.
The study emphasizes the need for clear identification of non-compliance in datasets to avoid legal consequences and suggests ongoing improvements to AI-driven legal review processes.