**Hacker News front-page analytics**... - Random

dredmorbius, 11 months ago
Hacker News front-page analytics

A question about what states were most-frequently represented on the HN homepage had me do some quick querying via Hacker News's Algolia search ... which is NOT limited to the front page. Those results were ... surprising (Maine and Iowa outstrip the more probable results of California and, say, New York). Results are further confounded by other factors.

Thread: https://news.ycombinator.com/item?id=36076870

HN provides an interface to historical front-page stories (https://news.ycombinator.com/front), and that can be crawled by providing a list of corresponding date specifications, e.g.:
https://news.ycombinator.com/front?day=2023-05-25 
Easy enough.

So I'm crawling that and compiling a local archive. Rate-limiting and other factors mean that's only about halfway complete, and a full pull will take another day or so.

But I'll be able to look at story titles, sites, submitters, time-based patterns (day of week, day of month, month of year, yearly variations), and other patterns. There's also looking at mean points and comments by various dimensions.

Among surprises are that as of January 2015, among the highest consistently-voted sites is The Guardian. I'd thought HN leaned consistently less liberal.

The full archive will probably be < 1 GB (raw HTML), currently 123 MB on disk.

Contents are the 30 top-voted stories for each day since 20 February 2007.

If anyone has suggestions for other questions to ask of this, fire away.

And, as of early 2015, top state mentions are:
 1. new york: 150 2. california: 101 3. texas: 39 4. washington: 38 5. colorado: 15 6. florida: 10 7. georgia: 10 8. kansas: 10 9. north carolina: 9 10. oregon: 9 
NY is highly overrepresented (NY Times, NY Post, NY City), likewise Washington (Post, Times, DC). Adding in "Silicon Valley" and a few other toponyms boosts California's score markedly. I've also got some city-based analytics.

#hn #hackernews #data #DataAnalysis #WebCrawling
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

Image

Image alternative text

dredmorbius, 11 months ago
Crawl complete:
FINISHED --2023-05-27 20:11:03-- Total wall clock time: 1d 17h 55m 39s Downloaded: 5939 files, 217M in 9m 48s (378 KB/s) 
NB: wget performed admirably:
grep 'HTTP request sent' fetchlog | sort | uniq -c | sort -k1nr 5939 HTTP request sent, awaiting response... 200 OK 14 HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. 1 HTTP request sent, awaiting response... Read error (Operation timed out) in headers. 
Each of the read errors succeeded on a 2nd try.

I'm working on parsing. Playing with identifying countries most often mentioned in titles right now, on still-partial data (missing the past month or so's front pages).

Countries most likely to be confused with a major celebrity and/or IT/tech sector personality: Cuba & Jordan.

Country most likely to be confused with a device connection standard: US (USB).

Raw stats, top-20, THERE ARE ISSUES WITH THESE DATA:
 1 US: 1350 (186 matched "USB") 2 U.S.: 1073 (USA: 59, U.S.A.: 2, America/American: 979) 3 China: 634 4 Japan: 526 5 India: 477 6 UK: 288 7 EU: 225 (E.U.: 54) 8 Russia: 221 9 Germany: 165 10 Canada: 162 11 Australia: 157 12 Korea: 140 (DRK: 69, SK: 38) 13 France: 116 14 Iran: 91 15 Dutch: 80 (25 Netherlands) 16 United States: 75 17 Brazil: 69 18 North Korea: 69 19 Sweden: 68 20 Cuba: 67 (32 "Mark Cuban") 
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier, alcinnz

dredmorbius, 11 months ago

How Much Colorado Love? Or a 16-year Hacker News Front Page analytics

I've pulled 5,939 front pages from Hacker News, dating from 20 February 2007 to 25 May 2023, initially to answer the question "how often is Colorado mentioned on the front page?" (38 times, 5th most frequent US state). This also affords the opportunity to ask and answer other questions.

Preliminary report: https://news.ycombinator.com/item?id=36098749

#HackerNews #dataAnalysis #wget #awk #gawk #media #colorado

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

denspier, 11 months ago

@dredmorbius This is a must read thread! I look forward to more analysis.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

@denspier Thanks, I'm working on it :)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago
I've confirmed that the story shortfall does represent actual HN experience. Several days with fewer-than-usual stories, one day of complete outage, mostly in the first year of operations:
2007-03-10: 29 2007-03-24: 26 2007-03-25: 25 2007-05-19: 27 2007-05-26: 26 2007-05-28: 29 2007-06-02: 19 2007-06-16: 28 2007-06-23: 17 2007-06-24: 28 2007-06-30: 20 2007-07-01: 28 2007-07-07: 27 2007-07-15: 26 2007-07-28: 27 2014-01-06: 0 
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago
I'm wanting to test some reporting / queries / logic based on a sampling of data.

Since my file-naming convention follows ISO-8601 (YYYY-MM-DD), I can just lexically sort those.

And to grab a random year's worth (365 days) of reports from across the set:
ls rendered-crawl/* | sort -R | head -365 | sort 
(I've rendered the pages, using w3m's -dump feature, to speed processing).

The full dataset is large enough and my awk code sloppy enough (several large sequential lists used in pattern-matching) that a full parse takes about 10 minutes, so the sampling shown here speeds development better than 10x while still providing representative data across time.

#ShellScripting #StupidBashTricks #Linux #DataAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

denspier, 11 months ago

@dredmorbius Why anyone anywhere use any other way to render dates than ISO-8601 is a mystery!

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago
One of the challenges of having an Eminently Queryable Data Trove is ... deciding what to query it about.

I've long thought that HN was fairly obsessed with various aspects of the hiring process, from both employer and worker perspectives.

Let's check that ...
$ egrep -i '(interview|hiring|recruiting)' <(grep '^ Title:' parse.log ) | wc -l 1282 
Ayup.

That's 1,282 stories out of 178,072, or just over 0.7%, but still a healthy chunk. By contrast, "housing" gets 90 hits, "Tesla" 413, and "Musk", 114.

Or the FAANG+M set:
Facebook: 2,414 Apple: 2,495 Amazon: 1,467 Netflix: 326 Google: 5,900 Microsoft: 1,523 
I'm still trying to sort out a way to search / determine "statistically interesting terms", that is words or phrases which are disproportionately represented in submission titles.

#HackerNewsAnalytics #hiring #interviews #recruiting
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

loke, 11 months ago

@dredmorbius Isn't that because YC companies gets to post their job openings on HN and it gets artificially promoted?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

@loke In part, and I can look for some specific strings...

"Who's Hiring" is a fairly regular post, I find 25 instances, oh, "Who is hiring?" is the more uniform version, 171 instances. About 13% of the initial total.

Filtering against "who is hiring", "is hiring" shows 24 hits for various YC ventures.

And excluding "is hiring" entirely ... mostly seems to land discussions about hiring (or stories about personnel changes at companies) rather than solicitations. 213 hits.

"Interview" also captures other forms of dialogue, e.g., "Interview with Michael Wesch", "Interview: Jimmy Wales", but also "Y Combinator interview tips".

The more specific "interviewing" gives 22 results, of which .... 21 are jobs-related ("Interviewing my mother, a mainframe COBOL programmer" is the exception).

"recruiting" seems to be entirely discussion / stories. 35 hits.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago
HN Front Page / Global Cities Mentions

One question I've had about HN is how well or poorly it represents non-US (or even non-Silicon Valley) viewpoints and issues.

Pulling from the Globalization and World Cities Research Network list, the top 50 global cities names appearing in HN front-page titles:
 1 191 San Francisco 2 164 London 3 117 Boston 4 86 Seattle 5 60 Tokyo 6 58 Paris 7 56 Chicago 8 56 Hong Kong 9 55 New York City 10 50 Berlin 11 50 Phoenix 12 45 Rome 13 40 Detroit 14 36 Singapore 15 31 Vancouver 16 30 Los Angeles 17 27 Austin 18 23 Beijing 19 20 Dubai 20 19 Shenzhen 21 19 Toronto 22 17 Amsterdam 23 16 Copenhagen 24 16 Houston 25 16 Moscow 26 15 Atlanta 27 14 Barcelona 28 14 Denver 29 13 Baltimore 30 13 San Jose 31 13 Stockholm 32 12 San Diego 33 12 Sydney 34 11 Cairo 35 10 Munich 36 10 Wuhan 37 9 Helsinki 38 9 Miami 39 9 Mumbai 40 9 Philadelphia 41 9 Shanghai 42 9 Vienna 43 8 Montreal 44 7 Beirut 45 7 Dublin 46 7 Istanbul 47 6 Bangalore 48 6 Dallas 49 6 Kansas City 50 6 Minneapolis 
(Best viewed in original on toot.cat.)

Note that some idiosyncrasies affect this, e.g., "New York City" appears rarely, whilst "New York" may refer to the city, state, any of several newspapers, universities, etc. "New York" appears 315 times in titles (mostly as "New York Times").

I've independently verified that, for example, "Ho Chi Minh City" doesn't appear, though "Ho Chi Minh" alone does:

https://news.ycombinator.com/item?id=15374051, on the 2017-9-30 front page: https://news.ycombinator.com/front?day=2017-09-30

So apply salt liberally.

Edits: tyops & speling.

#HN #HackerNews #DataAnalysis #ShellScripting #GlobalCities #MediaAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

According to the Hacker News front page, there are ...:

313 things that suck.

18 things that will fail.

116 things that rock.

157 things that are awesome.

0 things that are bollocks.

685 things that are great.

75 things that are terrible.

1 thing that is both terrible and amazing. And it is you.

28 things that are horrible.

22 things that are a list of some number of things.

33 things that are a list of some number of reasons.

0 hot takes.

3,101 things that are how to's.

6,434 things that are "hows" but not how to's.

98 things that are how not to's.

21 things that are silly.

86 things that are clever.

318 things that are smart, none of which are phones.

58 things that are brilliant.

147 things that are stupid.

20 things that are terrifying.

19 things that you must do.

Edit: Hashtag surgery (whitespace in hashtags is a thing that sucks).

#HackerNews #HackerNewsAnalytics #TooMuchFunWithGrep #Suck #Fail #Rock #Awesome #Bollocks

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

penguin42, 11 months ago

@dredmorbius How many of those 18 things have failed so far?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

The Hacker News front page has noted that 282 people have died.

#HackerNews #HackerNewsAnalytics #MediaAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

Things about which Hacker News cares being down, and of which it has noticed:

Skype network is down, possibly under viral DoS attack. Lessons?<br></br>Is this why Twitter is down? Their Engineer Speaks<br></br>Amazon is down ... implications for AWS?<br></br>The Website Is Down (Hilarious 10 Minute Video)<br></br>Matthew Simmons: The only way is down<br></br>GitHub is down<br></br>KK on Unabomber: pounce on [technology] when it is down and kill it before it rises again<br></br>Yes, Rackspace Is Down And So Are Many Of Your Favorite Sites<br></br>Tell HN: Authorize.net is down<br></br>Dreamhost is down. All of it.<br></br>Most of Slicehost is Down<br></br>Ubisoft DRM authentification server is down, Assassin's Creed 2 unplayable<br></br>Dropbox is down<br></br>Heroku is down for the third time today<br></br>Tumblr is Down – Fans Angry<br></br>Great. Skype is down.<br></br>Reddit Is Down To One Developer<br></br>AWS is down, but here's why the sky is falling<br></br>Amazon EC2 EU-West is down<br></br>Reddit is down for 12 hours protest SOPA and PIPA.<br></br>Java.sun.com is down again - breaking bad apps across the land<br></br>Heroku is down<br></br>Tell HN: Heroku is Down (update: recovering as of 10PM PST)<br></br>AWS is down due to an electrical storm in the US<br></br>Heroku is down again<br></br>Google Talk is down<br></br>GoDaddy's DNS Service is Down<br></br>Github is down<br></br>Netflix is Down<br></br>Hacker News is down, so we made five issues free<br></br>This site is down because the owner stiffed the web designer<br></br>Dropbox is down<br></br>WhatsApp is down<br></br>DreamObjects is down<br></br>Facebook is down (09:08AM PDT Aug 1, 2014)<br></br>YTMND is down for temporary maintenance<br></br>Google Cloud Is Down<br></br>GitHub is down<br></br>DigitalOcean block storage is down<br></br>Firefox usage is down despite Mozilla's top exec pay going up<br></br>Slack is down<br></br>[dupe] Slack is down<br></br>Tell HN: GitHub is down again<br></br>Kiwi Farms is down across all domains as DDoS-Guard terminates service<br></br>Twitter's API is down?<br></br>

#HackerNews #HackerNewsAnalytics #MediaAnalysis

dredmorbius, 11 months ago

So ... after a side conversation where I said I'd probably not do it, I built a couple of command-line query tools to my HN front-page archive which let me look at specific sites or users, and summarize activity by year, weekday, and the corresponding other entity (sites submitted by users, users who submitted a specific site), along with story count, and point and comment totals and means.

It's ... a bit of a god-mode view onto the site.

Mind that HN does show user history at a detail level (posts and comments), but doesn't give a good way to get an overview.

I quickly discovered (divide-by-zero error) that one person never posts on a specific day of the week. And there are various comings and goings as well.

It is a handy reminder that our activities online are highly public, and can be assimilated with relatively little effort in many cases.

There are reasons I went all but completely pseudonymous well over a decade ago.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

Many moons ago I made a conscious decision to abstain from working in fields involving intrusive data-based surveillance, including advertising, marketing, and (mostly) social networking. It's become apparent to me that my own personal boycott of such fields has probably had very little measurable impact on their development.

And so ... what I've been toying with over the past few days and the HN archive ... is something any modestly-motivated individual, let alone organisation, could do. The dataset itself is small (< 250 MB for the raw HTML, < 50 MB for the rendered content, slightly larger for my parsed output, and almost exactly 10 MB for the date summary (date, post number, site, submitter, votes, comments), and a few kB of awk code. My own not revealing this ... is a low hurdle for anyone else to do similar work. And in fact there are far more extensive analytic project, e.g.,:

"Top Hacker News commenters of 2021"
https://whaly.io/posts/top-10k-commenters-of-hacker-news-in-2021

And that's just one which is making itself known. It's the archives and analyses which aren't publicly acknowledged / accessible which are far more prevalent and as I see it, more troubling.

(This is one of the reasons that objections to public search/archival of Fediverse content strikes me ... and others (@alex has commented similarly a few times) as ... not only futile but largely misguided. The archives all but certainly do exist, they're just not serving the general public.)

I don't know how to re-cork this smoke / genie. I've relatively few thoughts on how to effectively combat it. There are the notions of the Internet as a Dark Forest (see: https://scribe.bus-hit.me/@onezero/the-dark-forest-theory-of-the-internet-7dc3e68a7cb1) and a Dead Internet (https://en.wikipedia.org/wiki/Dead_Internet_theory) which emerge as well: real people driven to small closed spaces, the public Web / Internet overrun by bots, spam, AI, and meme-generators. That these may already have occurred is a common belief.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

And for a set of privacy tips for the Fediverse, @jerry's advice is solid:

https://infosec.exchange/@jerry/110454585873865152

(Ironically: infosec.exchange is blocked by many instances for ... predictable reasons. You may need to click through directly to view this post.)

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

Hacker News "Leaders" front-page activity

So, more on that thing I said I wouldn't do but did anyway ...

Backstory: a dumb question lead me to crawl the HN front-page (FP) archive from 2007-present, just shy 6,000 pages, representing 178,162 stories, 52,400 distinct sites, and 43,491 distinct submitters. Each page has up to 30 stories, such that a fully-populated year has 10,950 or 10,980 (leap year) stories.

HN also provides a "leaders" page showing the top-100 members and "karma" (overall votes) --- latter being obscured for the top-10 members, though that can be found on their profile page. (https://news.ycombinator.com/leaders)

So ... I can get a summary of front-page activity for all leaders. It's ... interesting.

To assuage my guilt somewhat I'm only reporting overall / summary or anonymised stats. My goal isn't to out anyone specifically, but to give a sense of what HN front-page and "leader" member activity is like.

Seven leaders have no front-page posts at all, 17 have single-digit counts. The range is from 0 (obv!) to 1,183, mean 175.7, median 129, st.dev. 201.32, 10%ile: 3, 25%ile: 11; 75%ile: 253.5, 90%ile; 493.5.

Active years (years in which there is nonzero front-page activity) is ... all over the map -- there are members with results over 17 years, and with none at all.

What's ... peculiar ... is the points/karma% ratios. "Points' are votes on stories, "karma" is supposedly overall points (sum of story + comment moderation, less some for negative votes). The percentage of votes to overall karma ranges from 0 (no front-page activity) ... to 150.94%: more votes than cumulative karma. Points > overall karma (ratio > 100%) happens sixteen times, which is ... odd.

(Well, I mean, 16 is an even number, but the fact is odd-as-in-strange.)

One reason I've been doing this is to come up with some sense of overall quality metric. Engagements (votes and comments) are a highly-imperfect indicator, but looking at the arithmetic mean of votes and comments is interesting. I'm looking here at the average over all a member's front-page submissions:

Votes range from 0 to 634, mean 196.50, median 105,91, st.dev. 101.92, 25%ile: 150k 75%ile: 239.95.

Comments range from 0 to 323.75, mean 102.06, median: 96.38, 25%ile: 60.67, 75%ile: 123.16.

As might be expected, several members with lower-than-average submissions see high averages (there's more variance in small-n stats). One of the top-10 submitters (by average points and comments) has 514 FP stories, with an average of 236.37 points and 176.96 comments, and the most prolific submitter is very nearly median by votes and comments.

It's also possible to look at who's submitting a small or large range of sites by calculating a sites/stories% ratio. I'm finding, for example, one leader with 414 FP stories, from only 30 distinct sites, with the top site representing over half their submissions. (The site in question is legit and interesting, this does not appear to be spammy.) Several appear to favour their own personal sites / blogs, though again, not in a noxious way that I see. And 18 leaders have posted only a single item per site (each post is its own site), ranging from 1 to 20 FP items overall.

The ratio ranges from 0 (obv!) to 100 (obv!), mean 67.03%, median 71.83%, 25%ile: 51.82%, 75%ile: 89.72%.

#HackerNewsAnalytics #HackerNews #MediaAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

denspier, 11 months ago

@dredmorbius “16 is an even number, but the fact is odd” was a bit of a gem here 😂

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

@denspier I've got to keep myself entertained somehow.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

I was able to draw on my HN FP archive to respond in part to concerns over topic suppression by an HN member:

https://news.ycombinator.com/item?id=36191005

This is an interesting superpower ...

Not an awesome, superpower, mind, but an interesting one.

#HackerNews #HackerNewsAnalytics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 11 months ago

In fact-checking my own comment, I found that my success rate in reaching the HN front page is not the roughly 10% I'd thought.

It's pretty much spang on 3%, which is the overall site average.

That's based on my archive's count of my own FP submissions (60) and Algolia search's results for all my article submission, whether or not they hit the front page (1,974).

So I guess I'm just about average.

This gives me the idea of checking against the HN Leaders list to see if anyone's markedly above 3% for FP placements.

#HackerNewsAnalytics #HackerNews

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago
HN Front Page: Foreign Policy Top 100 Global Thinkers (2014)

I pulled a copy of the "global thinkers" list I'd used as an indicator of website salience in a 2015 study.

The HN front page offers a limited opportunity for matches --- titles are 80 characters only, and HN's editorial policy is to not list authors of works, so what will show here is likely a subset of actual mentions.

That said: nearly a quarter of the list (23 entries) appear, from 1 to 11 times each. Paul Krugman (11), Lawrence Lessig (10), and Richard Dawkins (10) top the list.
 1 Paul Krugman: 11 2 Lawrence Lessig: 10 3 Richard Dawkins: 10 4 Freeman Dyson: 9 5 Daniel Kahneman: 8 6 Noam Chomsky: 8 7 Jaron Lanier: 6 8 Steven Pinker: 5 9 Daniel Dennett: 4 10 Christopher Hitchens: 2 11 Craig Venter: 2 12 Edward O. Wilson: 2 13 Jared Diamond: 2 14 Richard Posner: 2 15 Steven Weinberg: 2 16 Thomas Friedman: 2 17 Gary Becker: 1 18 Hernando de Soto: 1 19 James Lovelock: 1 20 Larry Summers: 1 21 Martha Nussbaum: 1 22 Peter Singer: 1 23 Salman Rushdie: 1 
Thje 2015 post, "Tracking the Conversation" is here: https://old.reddit.com/r/dredmorbius/comments/3hp41w/tracking_the_conversation_fp_global_100_thinkers/

#HackerNews #HackerNewsAnalytics #MediaAnalysis #ForeignPolicy #Top100GlobalThinkers
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 11 months ago

Hacker News Analytics: ~3% of submissions reach front page, with half of comments on FP articles

This is a finding based on maths and a previous study by Whaly in 2022 based on HN 2021 activity, rather than my own crawl, though it's informed by the latter.

https://whaly.io/posts/hacker-news-2021-retrospective

The HN front page is a limited resource --- there are 365 * 30 == 10,950 front-page slots in a year, another 30, or 10,980, in a leap year, and regardless of site activity over a year, those slots are fixed. It's somewhat of a reminder that regardless of how much information we can access, our time to process that information is finite. Or as Herbert Simon observed: what information consumes is attention.

Whaly saw 386,663 total story submissions for 2021. I'm pretty sure that this is net of moderation (user flags, auto-kills, spam detection, voting-ring detection and the like). But it works out to a hair under 3% of stories not catching on any of those tripwires which then land on the HN front page.

Mind that that's actually a somewhat low estimate, as a story may appear for part of the day on the front page but not be represented on the end-of-day front-page archive.

I'm now thinking of doing some spot checks to see what kinds of success rates individual submitters have in landing on the front page. From what I've seen, even well-known and popular members have at best a modest chance of success.

Whaly also give a total number of comments: 3,769,520. That I can compare to my own front-page stats for 2021: 1,859,933, or 49.34% of all comments. That is, half of HN comments appear on the 3% of stories which reach the front page. That percentage is lower than what I'd have expected, though it's still a very strong bias toward the front page.

(Now I want to complete another analysis I'd thought of: mean votes and comments by story position (1--30), by year. Hrm...)

#HackerNewsAnalytics #HackerNews #MediaAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier, evan

dredmorbius, 11 months ago
So ... I'm playing with a report showing how often F500 companies are mentioned in HN submission titles.

As I've noted, most of my scripting is in awk (gawk), and it's ... usually pretty good.

I'm toying with a couple of loops where I read all 178k titles, and all 500 company names, into arrays, then check to see if the one appears in the other.

The first iteration of that was based on the index() function, which is a simple string match. Problem is that there are substring matches, for example "Lear" (the company) will match on "Learn", "Learning", etc., and so is strongly overrepresented.

So I swapped in match(), which is a regular-expression match, and added W as word-boundaries.

The index-based search ran in about 20 seconds. That's a brief wait, but doable.

The match (regex) based search ... just finished as I'm writing this. 13 minutes 40 seconds.

Regexes are useful, but can be awfully slow.

Which means that my first go at this --- still using gawk but having it generate grep searches and printing the match count only ... is much faster whilst being accurate. That runs in just under a minute here. I'd looked for another solution as awk is "dumb" re the actually output: it doesn't read or capture the actual counts, so I'll either have to tweak that program or feed its output to an additional parser. Neither of which is a big deal, mind.

Oh, and Apple seems to be the most-mentioned company, though the F500 list omits Google (or YouTube, or Android), listing only Alphabet, which probably results in a severe undercount.

Top 10 using the F100 list:
 1 Apple: 2447 2 Microsoft: 1517 3 Amazon: 1457 4 Intel: 554 5 Tesla: 404 6 Netflix: 322 7 IBM: 309 8 Adobe: 180 9 Oracle: 167 10 AT&T: 143 
Add to those:
$ egrep -wc '(Google|Alphabet|You[Tt]ube|Android)' hn-titles 7163 egrep -wc '(Apple|iPhone|iPad|iPod|Mac[Bb]ook)' hn-titles 3656 egrep -wc '(Facebook|Instagram)' hn-titles 2512 
Note I didn't even try "Meta", though let's take a quick look ... yeah, that's a mess.

Up until 2021-10-28, "Meta" is a concept, with 33 entries. That was the day Facebook announced its name change. 82 total matches (so low overall compared to the earlier numbers above), 49 post-announcement, of which two are not related to Facebook a/k/a Meta. Several of the titles mention both FB & Meta ... looks like that's four of 'em.

So "Meta" boosts FB's count by 45.

There are another 296 mentions of Steve Jobs and Tim Cook which don't also include "Apple".

And "Alphabet" has 54 matches, six of which don't relate to the company.

Of the MFAANG companies:
Google: 5796 Apple: 2447 Facebook: 2371 Microsoft: 1517 Amazon: 1457 Netflix: 322 
(Based on grep.)

#DataAnalysis #awk #grep #bash #HackerNewsAnalytics
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 10 months ago

I have Found My People: "What gets to the front page of Hacker News? A data project"

Some marketing dude is also looking at the HN front page. We're comparing notes ...

https://randomshit.dev/posts/what-gets-to-the-front-page-of-hacker-news

https://news.ycombinator.com/item?id=36521887

#HackerNewsAnalytics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 10 months ago

gagejustins's HN analysis has inspired me to take a crack at typifying Hacker News front page stories by type.

Whilst he'd manually assessed each front-page story, I'm classifying by site, so that an NY Times article on, say, quantum computing would still be described as "general news".

I've classified 10,200 of 52,642 domains, the first 300 or so manually, much of the rest using regexes and imputation (e.g., ".edu", ".gov", and sites on Blogspot, Substack, Medium, etc.).

Results by story count:

     1  13782  general news<br></br>     2  13398  software<br></br>     3  10473  tech news<br></br>     4   8677  blog<br></br>     5   7651  academic / science<br></br>     6   7294  n/a<br></br>     7   4750  ???<br></br>     8   4600  business news<br></br>     9   3546  corporate comm.<br></br>    10   1504  general magazine<br></br>    11   1291  general information<br></br>    12   1162  general interest<br></br>    13   1132  technology<br></br>    14   1099  videos<br></br>    15   1073  social media<br></br>    16    975  government<br></br>    17    568  corporate comm<br></br>    18    559  tech discussion<br></br>    19    505  tech law<br></br>    20    251  tech publications<br></br>    21    171  tech blog<br></br>    22    170  science news<br></br>    23    136  business education<br></br>    24    104  corporate comm. <br></br>    25    103  video<br></br>    26     99  corporate commm.<br></br>    27     96  general discussion<br></br>    28     80  misc<br></br>    29     71  technology / security<br></br>    30     61  law <br></br>    31     59  webcomic<br></br>    32     49  translation<br></br>    33     48  health news<br></br>    34     47  images<br></br>    35     46  podcast<br></br>    36     32  law<br></br>    37      7  legal news<br></br><br></br>  Unclassified: 93213<br></br><br></br>"n/a" indicates no site, e.g., an Ask, Tell, or Show HN post.<br></br><br></br>'???' indicates I couldn't (quickly) assess a domain.  Examples:  37signals.com, readwriteweb.com, thenextweb.com, archive.org, anandtech.com, avc.com, docs.google.com, righto.com, slideshare.net, infoq.com, hackaday.com, gamasutra.com, marco.org, smashingmagazine.com, highscalability.com, catonmat.net, centernetworks.com, jvns.ca, scribd.com, about.gitlab.com, cloud.google.com, alleyinsider.com, msn.com, firstround.com, axios.com, openculture.com, onstartups.com, ejohn.org, dadgum.com, shkspr.mobi, mixergy.com, geek.com, gmane.org, foundread.com.<br></br><br></br>"cproorate commm." is an obvious typo.  This is very rough code & classification.<br></br><br></br>#HackerNewsAnalytics #MediaAnalysis #HackerNews<br></br>

dredmorbius, 10 months ago
I'm continuing to play with this, and have classified a whole mess more sites (reminder to self: update that count) (response to self: 13,150 sites classified).

So that's about 25% of all sites that are classified. Looking by story count ... it's about 55% of all FP stories. (Power laws are your friend here...)

Looking at my current breakdowns (and again, this is all VERY ROUGH):
 1 15770 8.82% blog 2 15034 8.40% general news 3 13899 7.77% software 4 12889 7.21% tech news 5 7960 4.45% academic / science 6 7294 4.08% n/a 7 6025 3.37% corporate comm. 8 4859 2.72% business news 9 2120 1.19% social media 10 2031 1.14% general interest 11 1557 0.87% general magazine 12 1397 0.78% general information 13 1239 0.69% technology 14 1099 0.61% videos 15 975 0.55% government 16 607 0.34% ??? 17 559 0.31% tech discussion 18 505 0.28% tech law 19 497 0.28% misc documents 20 420 0.23% science news 21 316 0.18% mailing list 22 251 0.14% tech publications 23 171 0.10% tech blog 24 149 0.08% literature 25 136 0.08% business education 26 133 0.07% cryptocurrency 27 126 0.07% law 28 118 0.07% webcomic 29 109 0.06% entertainment news 30 103 0.06% health news 31 103 0.06% video 32 96 0.05% general discussion 33 80 0.04% misc 34 71 0.04% technology / security 35 49 0.03% translation 36 47 0.03% images 37 46 0.03% podcast 38 42 0.02% journalism 39 30 0.02% propaganda 40 29 0.02% healthcare / medicine 41 18 0.01% medicine 42 7 0.00% legal news Classified: 98966 Unclassified: 79916 Total: 178882 Ratio: 0.553 
My classifications are rough and I may revisit these. "blog" covers a lot of sins, though most are tech blogs (which makes "technology blog" redundant).

What I'd really like to do is to look at how trends vary over the years. Perhaps also by day of week / month of year. Finally answer that age-old question of whether HN is turning into Reddit....

As noted above, this is based on classifying the site rather than interpreting the title or reading the source article, so it's all a bit wobbly.

(This post formats better on toot.cat or on sites that render Markdown.)

#HackerNewsAnalytics #HackerNews #MediaAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 10 months ago

I've got this to about 60% of posts classified (by submitted site). I can continue winnowing this down, though there's obviously diminishing returns.

I've also revised my analysis code so that anything that's not classified defaults to "UNCLASSIFIED", without having to explicitly code that in the sites file.

I'm thinking of how I might crossref / correlate the site-based findings with title-based analysis. I'm also thinking of looking at average comments / votes by classification, as well as looking at the ratio of comments to votes (HN uses this as a very rough "flamewar" heuristic, though on somewhat shaky grounds IMO).

My sense is that many of the less-frequently-posted sites will turn out to be blogs of some form. I'm thinking of how I might assess this without having to key all of them.

<stage_whisper> random sampling <\stage_whisper>

One issue issue for less-frequently-occuring sites is that it's easy to code those which match a pattern (twitter, blogspot, livejournal, medium, substack, etc.) than those which are idiosyncratic. Note that a lot of Medium blogs don't appear on Medium domains, as well.

#HackerNewsAnalytics #HackerNews #MediaAnalyhsis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago
OK, current stats are 63.5% of posts classified, with 29.8% of sites classified, a/k/a the old 65/30 rule. The mean posts per unclassified site is 1.765, so my returns for further classification will be ... small.

Full breakdown:
 4 20 14 19 13 18 23 17 32 16 37 15 48 14 55 13 96 12 120 11 122 10 168 9 247 8 315 7 396 6 622 5 1052 4 2016 3 5103 2 26494 1 
A ... large number of sites w/ <= 20 posts are actually classified, mostly by regexp rules & patterns. Oh, hey, I can dump that breakdown as well:
 35 20 27 19 47 18 31 17 33 16 41 15 51 14 45 13 42 12 29 11 46 10 46 9 47 8 91 7 138 6 178 5 269 4 524 3 1624 2 11472 1 
I could pick just under 4% more posts by classifying another 564 sites but ... that sounds a bit too much like work at the moment. Compromises and trade-offs.

Now to try to turn this into an analysis over time.

I've been working with a summary of activity by site, so running analysis has been pretty quick (52k records and gawk running over that).

To do full date analysis requires reading nearly 180k records, and ... hopefully not having to loop through 52k sites for each of those. Gawk's runtimes start to asplode when running tens of millions of loop iterations, especially if regexes are involved.

#HackerNewsAnalytics #HackerNews #gawk #awk #DataAnalysis #MediaAnalysis
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago

So ... I'm starting to get the reporting by site classification across years down and ... it is interesting.

Preliminary and buggy code yet. Also this is highly dependent on how I've actually classified sites.

I've got a few classifications I'd wanted to keep an eye on:

Programming-specific sites. A lot of this is github and gitlab, basically, software projects with code. I'm distinguishing software (which is mostly about use) and programming which involves, or at least anticipates, actual development.

"Political commentary". I used this as a description for ... highly political sites (spot-checking to see what stories actually hit the front page, though I should be more robust in that). The list: reason.com, rt.com, bostonreview.net, alternet.org, cato.org, rootsofprogress.org, breitbart.com, dailykos.com, mises.org, dailycaller.com, jacobinmag.com, rawstory.com, tribunemag.co.uk, hoover.org, heritage.org, theroot.com, wsws.org, adamsmith.org, manhattan-institute.org, theblaze.com.

And there's "academic / science" which is mostly university and academic press / journal sites.

Anywho....

... at least from initial takes, the trending on these does not suggest a trending toward sensationalistic topics and/or sites, but the opposite. Much more programming FP stories in recent years, fewer political commentary, and more academic/science items.

Presuming this holds up as I code further.

This is one of the fun things about data analysis: stuff jumps out at you, sometimes confirming hunches, but often radically violating preconceptions.

I want to look more closely at what happens in the lead-up and follow-on to the 2016 US elections cycle in particular....

Hrm. What does spike is cryptocurrency-specific sites in 2014. Though that falls off again. (I suspect as that discussion enters more mainstream sources.)

And "general info" and "general interest" sites seem to rise in recent years.

#HackerNewsAnalytics #HackerNews #MediaAnalysis

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

denspier, 10 months ago

@dredmorbius His data collection method leads to a much smaller dataset. Does it still match to your conclusions, at least generally?

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago

@denspier I'm honestly not entirely sure, as it's not clear to me how much data he actually collected.

Oh dear goddess: a spreadsheet.

(I've done all my own analysis with textfiles and awk / shell tools

I can't even scroll the damned spreadsheet in my browser what with lag and stuff....

He's got the records running horizontally. (My practice, and the one I've generally seen, is records go down, fields go across.)

Looks as if he's only looking back to 2023-4-17. So that's a much smaller dataset than mine (to 2007).

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago

The Hacker News Ratio

One concept Hacker News uses to moderate discussions is a "flamewar detector", which based on moderator comments over the years is triggered when a discussion has > 40 comments AND there are more comments than votes on the article.

That had long struck me as questionable, but it's now something I can look at and ... it seems reasonably accurate. I've calculated ratios of all 178,882 HN Front Page stories (as of 2023-6-31), and ... do I have some ratios.

Basic stats:
n: 178882, sum: 89796.9, min: 0.00, max: 21.00, mean: 0.501990, median: 0.4, sd: 0.432899

Percentiles:
%-ile: 5: 0.08, 10: 0.13, 15: 0.17, 20: 0.21, 25: 0.24, 30: 0.27, 35: 0.3, 40: 0.33, 45: 0.37, 55: 0.44, 60: 0.48, 65: 0.53, 70: 0.58, 75: 0.64, 80: 0.72, 85: 0.82, 90: 0.96, 95: 1.22

Because of how I've parsed and processed data, it's not entirely straightforward to pull up the specific posts, though I can find those by the date and story position (ranked 1--30 on the page).

And ... yeah, the stories that tend to rate high based on this metric do tend to be sort of flamey.

The most ratioed post of all time was "juwo beta is released (at last!) Please use it and help improve it!", from 18 April 2007, at 21.0:

https://news.ycombinator.com/item?id=14253

Sometime around 2009--2010 the flamewar detector seems to have been implemented and ratios tend to be much lower, though there are still some pretty spicy discussions. One from the National Institutes on Health titled "Mental illness, mass shooting,s and the politics of American firearms", posted on 26 May 2022 (for a story originally dating from 2015) is the highest-ratioed post after the flamewar detector came into use, at 5.99:

https://news.ycombinator.com/item?id=31511274

I find it interesting how being able to query my archive affords insights on HN which aren't available through the standard search tools. It's possible to look for specific keywords, or submissions or comments from a specific account, but searching for contentious posts isn't really A Thing.

I'm doing some further digging to see what patterns might emerge by site, though finding a good minimum number of front-page appearances is one question I'm looking at.

#HackerNews #HackerNewsAnalytics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

dredmorbius, 10 months ago

The 20 "spiciest" sites seem to be (using a cut-off of 20+ stories):

apnews.com                     36      14674      17512     1.193<br></br>sfchronicle.com                25       5771       6174     1.070<br></br>variety.com                    24       5479       4992     0.911<br></br>mattmaroon.com                 73       3332       3023     0.907<br></br>axios.com                      92      38075      34150     0.897<br></br>bizjournals.com                20       2183       1959     0.897<br></br>cnbc.com                      174      59983      53056     0.885<br></br>apple.com                     241      99945      88396     0.884<br></br>reason.com                     70      13143      11614     0.884<br></br>nypost.com                     28       5851       5088     0.870<br></br>markevanstech.com              22        290        251     0.866<br></br>macrumors.com                  62      18700      16162     0.864<br></br>nikkei.com                     56      17568      15174     0.864<br></br>economist.com                 829     119205     102702     0.862<br></br>thewalrus.ca                   30       6194       5199     0.839<br></br>techradar.com                  30       7227       6053     0.838<br></br>backreaction.blogspot.com      33       7209       5968     0.828<br></br>strongtowns.org                27       8279       6857     0.828<br></br>mondaynote.com                 45       7581       6268     0.827<br></br>coindesk.com                   22      10236       8355     0.816<br></br>

And the 20 least spicy sites are:

particletree.com               37        997        227     0.228<br></br>brendangregg.com               40      11135       2512     0.226<br></br>intruders.tv                   28        324         73     0.225<br></br>aphyr.com                      34       8514       1910     0.224<br></br>andrewchen.typepad.com         51        757        168     0.222<br></br>michaelnielsen.org             31       3335        723     0.217<br></br>igvita.com                     38       3626        767     0.212<br></br>startuplessonslearned.blo      24       1101        232     0.211<br></br>citusdata.com                  51       8361       1717     0.205<br></br>ferd.ca                        21       5883       1132     0.192<br></br>ocks.org                       27       6036       1120     0.186<br></br>tensorflow.org                 22       5612       1020     0.182<br></br>aosabook.org                   21       3899        669     0.172<br></br>ocw.mit.edu                    41       8793       1500     0.171<br></br>david.weebly.com               20       1364        226     0.166<br></br>jslogan.com                    24         97         16     0.165<br></br>burningdoor.com                23        149         23     0.154<br></br>linusakesson.net               26       4531        684     0.151<br></br>github.com/0xax                22       2168        121     0.056<br></br>

#HackerNews #HackerNewsAnalytics

dredmorbius, 10 months ago

I should have named the fields above:

Site name (domain, sometimes more)

Story count

Total votes across all stories.

Total comments across all stories.

Ratio (comments / votes).

And, as usual for my data-dump posts, these read better on toot.cat than instances which don't pick up Markdown formatting. Click the date stamp to view original / formatted version.

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago
Hacker News "Ratio": political commentary sites

Continuing my look at the comments/votes ratio, a look at sites which tend to focus on political commentary and their "spiciness". These tend to be well above mean (0.63), median (0.52), and tend to be a standard deviation or more from the mean (1 sd: 0.78, 2 sd: 0.92, 3 sd: 1.06).
Stories Vote Comm Ratio Site 2 18 57 3.167 heritage.org 4 143 224 1.566 hoover.org 9 473 603 1.275 breitbart.com 8 1724 1873 1.086 cityobservatory.org 9 364 379 1.041 mises.org 1 56 55 0.982 adamsmith.org 7 2488 2372 0.953 city-journal.org 1 92 85 0.924 manhattan-institute.org 70 13143 11614 0.884 reason.com 5 854 722 0.845 jacobinmag.com 1 204 153 0.750 theblaze.com 13 1607 1202 0.748 bostonreview.net 5 1682 1252 0.744 tribunemag.co.uk 4 629 465 0.739 nationaljournal.com 5 1907 1400 0.734 americanaffairsjournal.or 12 2164 1584 0.732 alternet.org 10 1302 871 0.669 cato.org 5 738 493 0.668 dailycaller.com 9 1387 844 0.609 dailykos.com 5 759 450 0.593 rawstory.com 10 2538 1455 0.573 rootsofprogress.org 2 552 275 0.498 theroot.com 30 7881 3850 0.489 rt.com 2 1256 467 0.372 wsws.org 
Note that general news tends somewhat toward spicy, though not as much as the explicitly political sites. Of the 147 sites I'd identified as "general news", ratio statistics are:

n: 147, sum: 94.415, min: 0.092, max: comms,, mean: 0.642279, median: 0.605, sd: 0.433165

%-ile:

5: 0.234, 10: 0.341, 15: 0.4515,
20: 0.491, 25: 0.51, 30: 0.5305,
35: 0.5415, 40: 0.566, 45: 0.581,
55: 0.614, 60: 0.6285, 65: 0.654,
70: 0.68, 75: 0.716, 80: 0.734,
85: 0.7875, 90: 0.8715, 95: 1.1925

(As with other toots in this series, Markdown formatting is used, toot.cat may be better than your own instance's presentation.)

#HackerNews #HackerNewsAnalytics
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dredmorbius, 10 months ago

HackerNews changed how it dealt with highly-active discussions around January 2009, based on evidence I see (far fewer spicy threads after that date).

I'm also seeing that spicy stories actually tend to rank slightly higher on the page (a lower "storypos", that is, story position, value), which is counter to my expectation. This may of course be due to selection bias --- moderators specifically lift limit on overheated stories, so that those stories that do survive are more appropriate to HN.

I'd like to look at semantic / sentiment elements here as well, words or phrases which seem more prevalent on high-ratio stories. Here my analytic methods work against me as the HN title of a post is often quite short and not especially descriptive, though with some examples (as with the mental health study mentioned earlier).

#HackerNews #HackerNewsAnalytics

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ denspier

Add comment