AI's Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

Posted by birdculture 16 hours ago

Counter56Comment15OpenOriginal

Comments

Comment by p0w3n3d 12 hours ago

Normally people get punished for downloading illegal books. Allegedly someone at meta downloaded hella ton of illegal books and taught the LLM on them and they said "oh it was for his/hers private usage". You won't get justice here

Comment by muldvarp 11 hours ago

This to me is the most ridiculous thing about the whole AI situation. Piracy is now apparently just okay as long as you do it on an industrial scale and with the expressed intention of hurting the economic prospects of the authors of the pirated work.

Seems completely ridiculous when compared to the trouble I was in that one time I pirated a single book that I was unable to purchase.

Comment by p0w3n3d 3 hours ago

Recently archive.org got into trouble for renting one book (or fixed amount of books) exclusively on the whole world, like in a library. Sad men from law office came and made an example of them, but it seems that if they used those books to teach AI and serve the content in "remembered" way, they would get away with it.

Comment by Llamamoe 11 hours ago

We've essentially given up on pretending that corporations are also held accountable for their crimes in the recent years, and I think that's more worrying than anything.

Comment by Mathnerd314 8 hours ago

Well, so what the actual ruling was was that use of the books was okay, but only if they were legally obtained. And so the authors could proceed with a lawsuit for illegally downloading the books. But then presumably compensation for torrenting the books was included as part of the out of court settlement. So the lesson is something like AI is fine, but torrenting books is still not acceptable, m'kay wink wink.

Comment by lifestyleguru 11 hours ago

Hollywood and media publishers run entire franchises of legal bullies across developed world to harass individuals, and lobby for laws allowing easy prosecution of ISP contract owner. Even Google Books was castrated because of IP rights. Now I have hard time to imagine how this IP+AI cartel operates. Nowadays everyone and their cat throws millions on AI so I imagine IP owners get their share.

Comment by 10 hours ago

Comment by 1gn15 6 hours ago

This article commits several common and disappointing fallacies:

1. Open weight models exist, guys.

2. It assumes that copyright is stripped when doing essentially Img2Img on code. That's not true. (Also, copyright != attribution.)

3. It assumes that AI is "just rearranging code". That's not true. Speaking about provenance in learning is as nonsensical as asking one to credit the creators of the English alphabet. There's a reason why literally every single copyright-based lawsuit against machine learning has failed so far, around the world.

4. It assumes that the reduction in posts on StackOverflow is due to people no longer wanting to contribute. That's likely not true. Its just that most questions were "homework questions" that didn't really warrant a volunteer's time.

Comment by p0w3n3d 3 hours ago

Reg. 3 AI is a lossy compression of text indeed. I recommend youtubing "karpathy deep dive LLM" (/7xTGNNLPyMI) - he shows that the open texts used in the training are regurgitated unchanged when speaking to the raw model. It means that if you say to the model "oh say can you" it will answer "see by the dawn's early light" or something similar like "by the morning's sun" or whatever. So very lossy but compression, which would be something else without the given text that was used in the training

Comment by citizenpaul 15 hours ago

I'm not sure how this is much different then Amazon which has basically monetized the entire Apache Software Foundation and donates a pittance back to them in the single digit millions when they are profiting in the trillions.

Comment by y0eswddl 14 hours ago

It's not different.

There's also a huge problem with for-profit companies building on the work of FOSS without contributing resources or knowledge back.

Comment by p0w3n3d 12 hours ago

Nor sources

Comment by AndrewKemendo 8 hours ago

This article could just have been a link to the tragedy of the commons Wikipedia page

Humans destroying common resources until depleted is a feature not a bug

Comment by fithisux 13 hours ago

Personally I view the usage of AI as fencing.

Comment by stuaxo 13 hours ago

Thank you for this wonderfully succinct description, I shall steal it.

Comment by djmips 5 hours ago

without attribution?