Hacker Newsnew | past | comments | ask | show | jobs | submit | 7d7n's commentslogin

haha that wasn't me ;)


Oops, sorry! Wasn’t trying to make you look bad. Just a fan of your writing.


Not at all! I appreciate the kind words. Thank you!


Thank you for the feedback! I'm sorry you found it jargony/less accessible than you'd like.

The intended audience was my team and fellow practitioners; assuming some understanding of the jargon allowed me to skip the basics and write more concisely.


Pollution of online social spaces caused by rampaging d/misinformation is a growing societal concern. However, recent decisions to reduce access to social media APIs are causing a shortage of publicly available, recent, social media data, thus hindering the advancement of computational social science as a whole. To address this pressing issue, we present a large, high-coverage dataset of social interactions and user-generated content from Bluesky Social.

The dataset contains the complete post history of over 4M users (81% of all registered accounts), totaling 235M posts. We also make available social data covering follow, comment, repost, and quote interactions.

Since Bluesky allows users to create and bookmark feed generators (i.e., content recommendation algorithms), we also release the full output of several popular algorithms available on the platform, along with their timestamped “like” interactions and time of bookmarking.

This dataset allows unprecedented analysis of online behavior and human-machine engagement patterns. Notably, it provides ground-truth data for studying the effects of content exposure and self-selection, and performing content virality and diffusion analysis.


[deleted]


"Personal data" that was voluntarily published on a public microblogging platform with the explicit intention to share it with the world?


Would it make a difference if we were talking about articles on a news website? I'm kind of on the fence on this one but I can see the point of view that just posting something online doesn't necessarily grant the end user an unlimited license to use the data. Source code is another example; open-sourcing a project doesn't automatically give someone else the right to use that code in their own projects.

Does Bluesky explicitly state the license the user will be publishing under (Creative Commons or whatever), or allow them to choose one?


> Would it make a difference if we were talking about articles on a news website.

News articles are pretty explicitly copyrighted and published for a commercial purpose. The websites make their terms clear when you visit. I don't think anyone can argue that it is legal to copy and distribute these articles, same as a book or movie or song.

Data posted on Bluesky on the other hand is meant to be broadly shared using the AT protocol. It is quite literally a feature. If you create your own Bluesky client, for example, you aren't committing copyright violation by downloading someone else's posts on there. Similarly, you aren't going against any terms of service by consuming a firehose of data from an AT relay.


Right, that's why I asked about Bluesky's content license; just because it's not in your face when you visit, doesn't mean you don't have to abide by it.

You understand that categories of usage are important, right? No-one is breaking the GPL by reading source code, but incorporating into your own codebase can be problematic if not done correctly. Similarly, human beings reading the data posted by a Bluesky user is not the same as aggregating and analysing the data of thousands of users. As I said I'm on the fence with this, but I do understand why someone might have a problem with it.


[deleted]


The data is public by default. You know this when you sign up and use the service. This should inform your expectations of how the data will be used.


Don't make it public then.


[deleted]


Is it more entitled to observe public data than to willingly put data in the public and then expect to control the actions of others?


Is you reading my comment on HN also entitlement? I certainly didn't give you permission to do it. It may have some personal details that I don't want you to see. Why do you think that is okay?


I wonder how much time it takes to run this / what the script is / how resource intensive it is? Bsky is public right, so do you get rate limited? Do you scrape or use an official API? So many questions

Also, I feel like only recently there's been an influx of people who have actually interesting things to say so I'd love to see nextyear's dataset


Not sure about bulk export but you can set up a full stream of all activity without even registering an account.


Blows my mind that they can send that much for free.


I was checking out the Python API today (the "firehouse" via "atproto" package) and got 5000 posts in 7.5 seconds.


I believe they are enabling(ed?) filters so you can control how much and what you actually get from the firehose


haha I'm glad you noticed!

it's originally "Ready to ~~delve~~ dive in?" but something got lost in translation


100% agree with Ted's take. One of the authors wrote about splitting up prompts here too: https://eugeneyan.com/writing/prompting/#split-catch-all-pro...


The goal is to solve complex problems with as simple a solution as possible.


The goal is to provide user value. Complexity is falsely valued.


Who says that's the goal? Providing user value is falsely valued.


Why do people need goals to do things? Having a goal is falsely valued.


Why do people do things? Doing things is falsely valued.

(Am partially sentient collection of coral, YMMV)


Why are things? Reality is falsely valued.


For some use cases, legal reasons such as proprietary/private data, copyright, terms of service, prevent the use of a 3rd-party API.

On the other hand, directly using an off-the-shelf model, even the best ones, may not meet your performance requirements.

That’s where fine-tuning an open LLM is necessary.


I've noticed that senior devs tend to spend more time writing documents (e.g., design docs, API specs) than coding.

How do you measure the impact of writing a document (vs coding a feature)? What's the right balance between both?


When you examine the interests of junior developers it typically boils down to: how to write code. This is foundation of which tools and frameworks to use, what rules to apply and so forth. That line of thinking has virtually nothing to do with product.

As the article states very well one of the most important impacts of writing is planning. If you want to do anything original or competitive you must have some idea of what you are doing before writing code. Writing addresses that concern. Eventually, with practice, the planning and vision becomes a quick thought exercise and writing is the means by which to express those thoughts to other people.

If you want to measure the impact step back and examine the end product. Are the internals well communicated? How much ramp time does a new developer require to become a productive contributor?

Think of writing as a defensive tool. It’s not going to push your software to make more money. It’s going to reduce expenses.


This, 100%. I wrote about this exact thing a while ago. In my opinion, it's how developers truly get to 10x.

https://got.phillipwills.com/10x-as-an-individual-contributo...





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: