You know you buy something from a website and then you get loads of bollocks sent to your email address, even though you selected the 'Don't send me bollocks' check box.
I mean, occasionally I go Hmm I might wanna buy that, but usually I don't. But they continue to send the bollocks all the same.
Hospitals now send out Customer Satisfaction surveys after appointment and ask if you’d recommend them to friends and family.
“Yes I really enjoyed seeing the doctor for insert medical condition so I really hope all my friends and family can get insert medical condition so they have the joy of experiencing the facilities here”.
So months back they dug up the road. City fibre put cables down to upgrade . Keep getting junk mail inviting me to upgrade my broadband... "it's here. Full fibre. Upgrade now" and each time I check someone out and put my address in it just says "on our way. We'll let you know when full fibre is available in your area". It really is a metaphor for the un-joined up way much of the UK "works". #UK#broadband#FullFibre#bollocks
A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.
HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.
But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.
Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.
The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.
Contents are the 30 top-voted stories for each day since 20 February 2007.
If anyone has suggestions for other questions to ask of this, fire away.
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.