What is a good way of crawling and archiving a complete website for offline viewing? One of the services that has a lot of data that is important to me is shutting down without any options for archiving my data. It has a pretty JavaScript heavy UI and is protected with a login page that includes MFA. Ideally I'd be able to save it to a nice format like WARC.
AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....
What’s the Best Way to Store Data for Decades or Centuries? Bottom line: No Technology that is Easy or Practical
The concern keeps coming up (I’ve also been pondering it a lot and posted in this last week what I’ve been doing).
This linked article does sum up the essentials very well, and this helps illustrate why this is a challenge for 20 or 60 years or especially ...continues
#MastodonHelp
Is there any tool to download public posts and retoots and reply toots of an inactive Mastodon account? I forgot to backup my posts before migrating, and I don't know a method that doesn't involve disabling inactivity on the account (and not sure what that will do).
Some folks on mastodon delete posts after a period, sometimes for privacy sometimes to save server space.
Is there a nice way to download a thread/archive it?
I don’t want to distribute them, I think there’s two cases:
1 I like having an archive of stuff I’ve said so I can look at it years later.
2 people have good advice/essays and I’d like to read them in the future.
#MUSIC#IDENTIFICATION HELP WANTED!
I've finally got transferred an open reel I bought from a recycling shop in #Toronto , ON, many many years ago.
Unfortunately I have no idea the artist, album, or song names!
Do you recognize any of the music, vocals or instrumentation in this sample I've prepared?
"In the in-depth study that was published in partnership with the Software Preservation Network, it was revealed that just 13% of all games released before 2010 are commercially or readily available today."
So uh. Best software to rip DVDs? I've tried with VLC, but it spent half an hour going through the entire 2 hour movie and then rendered only the 8 second intro to file 😬
I don't need the whole menu and all, but I need to be able to get the video, the right audio track, and the right subtitle track. I've got a bunch of old DVDs here, 10+ years old and sometimes more, that I'd like to archive before bitrot sets in.
I have an org file for a long-running project. It's getting hard to manage because there are lots of different tasks, events, etc.
I think I want to create an "archive version" of that file, which would have the same structure but store items, say, with a timestamp older than 2 months. That would require two basic steps:
extracting a subtree from the original file;
merging the extracted subtree into the archived version.
I could implement that, but I wonder if there is any existing way for that? Or some other approach that would address the same issue?
Thanks Amy @grinn for pointing me to the necessary pieces of org-refile! It would have taken much longer to figure out otherwise.
I've made a function that org-refiles the entry at point into "archive/<file-name>.org", preserving the header structure. I only had to implement creating nonexistent headers because `org-refile' can create just one level out-of-the-box.
And another function that performs that operation on all entries found by `org-ql'.
“NARA will block access to commercial ChatGPT on NARANet [an internal network] and on NARA issued laptops, tablets, desktop computers, and mobile phones beginning May 6, 2024,” an email sent to all employees, and seen by 404 Media, reads. “NARA is taking this action to protect our data from security threats associated with use of ChatGPT.”
The move is particularly notable considering that this directive is coming from, well, the National Archives, whose job is to keep an accurate historical record. The email explaining the ban says the agency is particularly concerned with internal government data being incorporated into ChatGPT and leaking through its services."
What’s the Value of 3 Million LPs in a Digital World? Easy! They can be Played still in 50+ Years’ Time!
The ARChive of Contemporary Music has one of the largest collections of vinyl records in the world and is in danger of losing its home. Its champions are making a case for the future of physical media.
If someplace like a university starts a digitization p ...continues
Harvard Library Innovation Lab: WARC-GPT: An Open-Source Tool for Exploring Web Archives Using AI
"...an open-source, highly-customizable Retrieval Augmented Generation tool the web archiving community can use to explore the intersection between web #archiving and #AI. WARC-GPT allows for creating custom chatbots that use a set of #web#archive files as their knowledge base, letting users explore collections through conversation." 👏
Occasional reminder that the Internet Archive provides a number of tools and browser plugins to let you send pages to the Wayback Machine (as well as check if a given page has been saved):
Checked my 6,921 bookmarks on Pinboard.in: 3,462 hit dead ends with 404s or expired domains, and many of the 3,459 left show fake content or parking pages. Only 21% from the last 2 years still work as expected. The lifespan of URLs is definitely shrinking.
#Archiving#AcademicPublishing#DigitalPreservation: "When Eve broke down the results by publisher, less than 1 percent of the 204 publishers had put the majority of their content into multiple archives. (The cutoff was 75 percent of their content in three or more archives.) Fewer than 10 percent had put more than half their content in at least two archives. And a full third seemed to be doing no organized archiving at all.
At the individual publication level, under 60 percent were present in at least one archive, and over a quarter didn't appear to be in any of the archives at all. (Another 14 percent were published too recently to have been archived or had incomplete records.)
The good news is that large academic publishers appear to be reasonably good about getting things into archives; most of the unarchived issues stem from smaller publishers.
Eve acknowledges that the study has limits, primarily in that there may be additional archives he hasn't checked. There are some prominent dark archives that he didn't have access to, as well as things like Sci-hub, which violates copyright in order to make material from for-profit publishers available to the public. Finally, individual publishers may have their own archiving system in place that could keep publications from disappearing."
Need help on saving reddit threads (for post-blackout reasons) to Obsidian
AI-TRIGGER WARNING: I've asked ChatGPT to revise my writing because it was ass (writing a stream of coherent looking text is not my forte). Proceed at your own discretion....
OC If you want to save the existing reddit content for future off-reddit use, you should get involved with Archiveteam
Archiveteam's Reddit project is working to save reddit content from the hungry maw of corporate destruction....