I’m conflicted. I agree with everything you say but I’m concerned about Bluesky eventually being flooded with AI posts trained on its public dataset. Being open could very easily lead to downfall.
I’m curious to see how things look, say, ten years from now. The way people use social networks, the language they use, even the memes they trade in changes over time. I can absolutely imagine an out of date AI giving themselves away by repeating todays equivalent of “rawr xD” to a future audience.
If all data was in a state where people had to pay for it then only large companies can use it. With open data it's very foreseeable in 10 years time it will be very likely a hobbyist can train a performant LLM at home from scratch.
Not all the profit, really. All would imply there was no value to begin with. I get the dislike, but i still comment on the open web because it has value to me. I'm still willing to answer questions on SO/reddit/etc because it has value to at least one (and hopefully more) people. That hasn't changed.
Not sure what to say about companies making money off of my data.. but the posting itself doesn't seem to be that much of a negative.
Thoughts? I see this sentiment a lot and it almost feels like "open" is bad these days. If anything i feel it almost is more important than ever.. as we're on the cusp of no need to ever go to forums/interact/etc.
Because unlike the authors of this set - who went and stripped the posts out of usernames and permalinks to anonymize it - that set you mention just grabbed data out of the API as-is (at least based on its huggingface description that's left over).
Just a reminder that anonymization is much harder than merely removing metadata:
Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.
Is it "scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder" in the OP?
The "code availability" says it's released "alongside [the dataset]", which appears to be the OP.
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
Would it make a difference if we were talking about articles on a news website? I'm kind of on the fence on this one but I can see the point of view that just posting something online doesn't necessarily grant the end user an unlimited license to use the data. Source code is another example; open-sourcing a project doesn't automatically give someone else the right to use that code in their own projects.
Does Bluesky explicitly state the license the user will be publishing under (Creative Commons or whatever), or allow them to choose one?
> Would it make a difference if we were talking about articles on a news website.
News articles are pretty explicitly copyrighted and published for a commercial purpose. The websites make their terms clear when you visit. I don't think anyone can argue that it is legal to copy and distribute these articles, same as a book or movie or song.
Data posted on Bluesky on the other hand is meant to be broadly shared using the AT protocol. It is quite literally a feature. If you create your own Bluesky client, for example, you aren't committing copyright violation by downloading someone else's posts on there. Similarly, you aren't going against any terms of service by consuming a firehose of data from an AT relay.
Right, that's why I asked about Bluesky's content license; just because it's not in your face when you visit, doesn't mean you don't have to abide by it.
You understand that categories of usage are important, right? No-one is breaking the GPL by reading source code, but incorporating into your own codebase can be problematic if not done correctly. Similarly, human beings reading the data posted by a Bluesky user is not the same as aggregating and analysing the data of thousands of users. As I said I'm on the fence with this, but I do understand why someone might have a problem with it.
Is you reading my comment on HN also entitlement? I certainly didn't give you permission to do it. It may have some personal details that I don't want you to see. Why do you think that is okay?
I wonder how much time it takes to run this / what the script is / how resource intensive it is? Bsky is public right, so do you get rate limited? Do you scrape or use an official API? So many questions
Also, I feel like only recently there's been an influx of people who have actually interesting things to say so I'd love to see nextyear's dataset
I’m glad to see a new platform that isn’t completely locked down, allowing analysis like this.
The trend toward everything being a walled garden is unfortunate.
I’m conflicted. I agree with everything you say but I’m concerned about Bluesky eventually being flooded with AI posts trained on its public dataset. Being open could very easily lead to downfall.
I feel like LLM models have had the opportunity to be trained on a sufficient amount of social media posts at this point that it's unlikely to matter.
I’m curious to see how things look, say, ten years from now. The way people use social networks, the language they use, even the memes they trade in changes over time. I can absolutely imagine an out of date AI giving themselves away by repeating todays equivalent of “rawr xD” to a future audience.
[flagged]
This is wonderful! Openness means open to all.
And Open data decreases barriers to entry.
If all data was in a state where people had to pay for it then only large companies can use it. With open data it's very foreseeable in 10 years time it will be very likely a hobbyist can train a performant LLM at home from scratch.
Actually, since this isn’t locked up by the big copyright holders, we can all use it and profit.
How will you use it to profit? You don’t have sweetheart cloud deals on ML training clusters. This benefits big players, not us.
"us" is relative.
There are plenty of people on HN who have their own ML training clusters and aren't really big tech. For example natfriedman has https://andromeda.ai/
And right now, today I can fine tune LLMs on this scale of data at home. In 5 or 10 years people will be able to training from scratch at home.
Computational resource barriers are temporary. Licensing is forever.
Most impactful ML can be created on colab. Not Chatgpt, but most of the stuff not on the long tail.
Not all the profit, really. All would imply there was no value to begin with. I get the dislike, but i still comment on the open web because it has value to me. I'm still willing to answer questions on SO/reddit/etc because it has value to at least one (and hopefully more) people. That hasn't changed.
Not sure what to say about companies making money off of my data.. but the posting itself doesn't seem to be that much of a negative.
Thoughts? I see this sentiment a lot and it almost feels like "open" is bad these days. If anything i feel it almost is more important than ever.. as we're on the cusp of no need to ever go to forums/interact/etc.
Meanwhile, over at Blueksy.app a few days ago, the users were incensed about a 1M-post data set and hounded the creator in to withdrawing it.
https://bsky.app/profile/danielvanstrien.bsky.social/post/3l...
Because unlike the authors of this set - who went and stripped the posts out of usernames and permalinks to anonymize it - that set you mention just grabbed data out of the API as-is (at least based on its huggingface description that's left over).
That's the difference.
Just a reminder that anonymization is much harder than merely removing metadata:
Every time I hear "anonymous data", I think of that time AOL published anonymized search logs (for academic research). The anonymization was negligent, and an NYT reporter de-anonymized and tracked down one of the users with the local & personal info present in the search queries.
https://en.wikipedia.org/wiki/AOL_search_log_release
https://web.archive.org/web/20130404175032/http://www.nytime...
associated paper: https://arxiv.org/abs/2404.18984
hmm could you find the Github? I couldn't find it in the paper in the Code Availability section
Is it "scripts.tar.gz. A collection of Python scripts, including the ones originally used to crawl the data, and to perform experiments. These scripts are detailed in a document released within the folder" in the OP?
The "code availability" says it's released "alongside [the dataset]", which appears to be the OP.
oh good eye, I didn't catch that
The paper is from the end of April and they say the data was collected in February, March and April. I guess we can talk about it now, though.
Due to high growth since then, this is from before most current users joined.
If you just want to play around with the data, check out the bsky dataset on Axiom https://play.axiom.co/axiom-play-qf1k/stream/bsky (700M+ events and counting)
Please upload it on the Hugging Face Hub!
Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.
The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.
Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.
This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.
[deleted]
"Personal data" that was voluntarily published on a public microblogging platform with the explicit intention to share it with the world?
Would it make a difference if we were talking about articles on a news website? I'm kind of on the fence on this one but I can see the point of view that just posting something online doesn't necessarily grant the end user an unlimited license to use the data. Source code is another example; open-sourcing a project doesn't automatically give someone else the right to use that code in their own projects.
Does Bluesky explicitly state the license the user will be publishing under (Creative Commons or whatever), or allow them to choose one?
> Would it make a difference if we were talking about articles on a news website.
News articles are pretty explicitly copyrighted and published for a commercial purpose. The websites make their terms clear when you visit. I don't think anyone can argue that it is legal to copy and distribute these articles, same as a book or movie or song.
Data posted on Bluesky on the other hand is meant to be broadly shared using the AT protocol. It is quite literally a feature. If you create your own Bluesky client, for example, you aren't committing copyright violation by downloading someone else's posts on there. Similarly, you aren't going against any terms of service by consuming a firehose of data from an AT relay.
Right, that's why I asked about Bluesky's content license; just because it's not in your face when you visit, doesn't mean you don't have to abide by it.
You understand that categories of usage are important, right? No-one is breaking the GPL by reading source code, but incorporating into your own codebase can be problematic if not done correctly. Similarly, human beings reading the data posted by a Bluesky user is not the same as aggregating and analysing the data of thousands of users. As I said I'm on the fence with this, but I do understand why someone might have a problem with it.
[deleted]
The data is public by default. You know this when you sign up and use the service. This should inform your expectations of how the data will be used.
Don't make it public then.
[deleted]
Is it more entitled to observe public data than to willingly put data in the public and then expect to control the actions of others?
Is you reading my comment on HN also entitlement? I certainly didn't give you permission to do it. It may have some personal details that I don't want you to see. Why do you think that is okay?
I wonder how much time it takes to run this / what the script is / how resource intensive it is? Bsky is public right, so do you get rate limited? Do you scrape or use an official API? So many questions
Also, I feel like only recently there's been an influx of people who have actually interesting things to say so I'd love to see nextyear's dataset
Not sure about bulk export but you can set up a full stream of all activity without even registering an account.
Blows my mind that they can send that much for free.
I was checking out the Python API today (the "firehouse" via "atproto" package) and got 5000 posts in 7.5 seconds.
I believe they are enabling(ed?) filters so you can control how much and what you actually get from the firehose
Sound like this could be used to train an open source LLM.