hckrnws
Show HN: Scanned 1927-1945 Daily USFS Work Diary
by dogline
My great-grandfather Reuben P. Box was a US Forest Ranger in Northern California, and I've got his daily work diary from 1927-1945, through the depression, WWII, Conservation Corps, and lots of forest fires. I've scanned the entire thing, had Claude help with transcription, indexing, and web site building, and put the whole thing here:
This is one of those projects I've sat on for years, but with Claude and Mistral helping with the handwriting recognition, and even helping me write a custom scanning app that would auto scan each page and put it into a database as I assembled everything.
As far as I know, this is the only US Forestry Diary that has been fully scanned in and published. I understand that there are other diaries in some collections, but none have been scanned in. I hope this helps somebody. Please let me know if it does.
This is the sort of project Claude and AI can help with - A personal project that sits on the shelf forever, but now a reasonable project that can be published in my spare time. I'm not trying to earn money on this, but just improving our knowledge and history just a little bit.
Also, just to clarify, I scanned all 7488 pages in personally (Fujitsu ScanSnap ix500). With Claude's help, I found some undocumented SANE features to auto crop and fix the scans, then had a Python script in Linux auto scan them and put them into a Postgres database as I went. Other scripts would add transcription, summaries, and auto index everything.
"mistral-ocr-latest" did really good handwriting transcription, considering how tight and small some of the handwriting is. Then back to Claude API calls to summarize by month and collect people and places from all of the entires.
Claude then created static html pages from what started as a Flask app. Published on Dreamhost.
Oh boy. #3 on front page, 19k page hits in the first hour. 8243 static html pages, 15728 webp images (10k-50k each).
I've never had one of my sites with this much traffic. With everything as static files, website is still holding. Thank you all.
Before your server catches a fire and burns down the originals: please also send them to archive.org
A fresh training dataset ;-)
Yeah. If there are groups that want the high resolution images, talk to me.
Could consider putting it up as a dataset on Kaggle, perhaps? I would think they'd provide hosting for such things?
Archive.org would be another option as a repository for the high-res scans in an accessible / discoverable location.
That's amazing!
I'm working on a kinda similar project (documenting bank runs from historical newspapers) and also opted for Claude to build a static website. Crazy that the two sites have a very similar look and feel: https://www.finhist.com/bank-runs/index.html . The only big difference is that mine lacks a map, which I should hopefully fix soon (I already have lat and lon and am linking to google maps).
PS: Do you know if mistral works better at OCRing handwritten text than gemini 3? Was planning on going the gemini3 for another project
That's cool! I've noticed when asking for Claude for a website, it does have a certain look, like our two sites, if you don't give it any more guidance. I'm not sure if that's a good thing or not.
Digitizing history in different ways, with different resources that are unique or only known to small groups, might be a new development area, and that's exciting. As I've shown, and how other people have shared, using AI tools to digitize things which haven't previously been done before is now possible. Are there ways to make this easier for everybody? New techniques to discuss? I don't know, and I'd love to talk about it.
Concerning OCR: I used Mistral because of a posting here describing advancements with handwriting recognition a month or so ago. I didn't actually compare them. And I've got my setup that I can rerun everything again later if there are advancements in the area. Again, another area to keep track of and discuss.
Thanks for the insights! I'll try Mistral as well.. Gemini worked well for me so far but which model is SOTA is changing quite frequently these days
This is great! I love it when people take bits of history that works be forgotten and put them out in the world (to be further vacuumed up by Internet Archive). Thank you for doing it.
Beej! Thank you very much! Your networking guides have long been a great contribution to everybody, and collectively improves what we know.
These diary pages come largely from Stirling City, just north of Chico, and later from the Hat Creek district, on Hwy 89 north of Mt. Lassen. Nearby, many historical records were lost in the Paradise Camp Fire, and digitizing some of the records in some of the local museums is something this is a test run for.
—Lance (CSU Chico ‘93 Computer Engineering)Nice work! For others with journals in the U.S., but not feeling up to all the scanning and transcription work, I volunteer with the American Diary Project (https://americandiaryproject.com/) based in Cleveland Ohio. You can donate journals to be archived and shared. It's only been established in the past few years, and all scanning/transcription is done by volunteers, but are currently evaluating more automated pipelines like OPs. So great to see it in practice!
Do you know of any similar organization that would be interested in tackling a lavishly written and illustrated school newspaper published in Germany from 1902-1906? It's about 50 issues total, each about 12 pages long, and I've already scanned them all; I just need somebody to do the heavy lifting of setting up the OCR/transcription pipeline that can handle old German script. (Bonus: the newspaper was co-created by the future wife of a pretty famous person.)
Very cool. Some feedback:
- I think it would be a very large improvement if the actual diary pages/transcriptions were more accessible. I found the LLM summaries completely uncompelling, and did not particularly appreciate having to scroll through 5+ pages of LLM summary to get to the part where I could actually read the diary entries for a given month.
- The dates of the diary entries for many months are broken. For example, in the final month, all of the entries are labelled 1945-03-19. From a cursory examination, I believe the dating broke 24th July 1941 and was broken for every month from there to the end.
- The page for Nov 1941 seems entirely broken. For some reason, the dates labelling the pages are described in a different format that included the name of the month rather than a numeric representation, the pages are out of order, and then all manner of months are mixed in. The first pages are "November 1941", "April 1941", "October 2 1941", "October 3 1941", "November 4 1941", "November 12 1941", "November 7 1941" ... and so on. The LLM summary notes an "Event", a construction project that took place from 1931 to 1934, despite this being the entry for Nov 1941.
Addendum: After further consideration, I would like to offer two specific suggestions regarding the first point.
Low effort, minimal change suggestion: a link or table of contents header at the top of each month's page to jump to the diary entries.
Higher effort, bigger change suggestion: I think it would make for a significantly better reading experience if all of the diary pages and their transcriptions for a month were listed sequentially, such that you could seamlessly read them without clicking previous/next page.
I think it's a bit of a waste to have put so much effort into preserving this, but the actual ability to read it is de-prioritised relative to the ability to read an LLM summary.
You're right, and it's a good idea. The summary started out small, as a header to the actual daily pages, but then I realized I could have AI do a lot more work here, including silly things like collect weather references and assemble them together. My prompt kept getting bigger to find trends in the data. But, it takes away from the view-ability of the site, which is not good.
LLM's ability to take 7400+ handwritten entries and try to make a narrative out them is amazing. With all of the AI experiments on HN lately, we're figuring out the power of LLMs, but it most cases, it still needs a human refining touch, and we need to remember that. Or else it just looks like AI slop.
I certainly don't think it's a bad thing to try to refine the information into a more digestible form. I think, for example, the dedicated "People", "Places", "Events", and "Map" sections are well-organized and interesting[1]. I would simply prefer if the presentation of this information did not detract from the ability to read the diary itself, as it does on the month pages. I am rather fond of reading historical diaries as part of a general curiosity about the past, and reading the experiences as they were written is as interesting to me if not more so than the aggregate information, personally.
[1] Although, of course, there is the question of reliability. For example, the "Boy Scouts" page says Boy Scouts have 2 mentions, but has references to 3 diary entries! Also, on further examination, Sep 1931 has broken dates (meaning my previous theory about it breaking only after Jul 1941 was wrong), and some pages appear to be out of order.
Imagine how much unanticipated historical perspectives might become uncovered if everyone uploaded paraphenelia of long deceased ancestors like this; after indexing, and searched as one hyper-amalgamated crowdsourced knowledge graph, can show who was where doing what in say the 1920s, 1930s, 1940s in a way that mainstream history might fall short of capturing.
As a byproduct of wandering southern Indiana, I’ve taken several shallow dives into the history of people whose names I’ve found in cemeteries, or the counties I’ve explored, and it’s addictive.
There are so many interesting stories out there, from the attorney general who summed up the evidence presented at a trial as “the victim lynched himself and his fellow thieves in jail” to the couple with 6 male children who named each successive pair of boys using the same initials (e.g. Carl Ervin, Carwin Earl; Truman William, Tresman Walter; Llewellyn Purcell, Lealyn Percy).
The resources that are already available are amazing (one woman, Violet Toph, assembled thousands of pages of memories and genealogy records for her county in the mid-20th century) but obviously very incomplete. Your idea would be a terrific way to help fill in some of those gaps and encourage people to keep their own memoirs somewhere outside Facebook.
Ai really loves that card style
Nicely done.
It inspires me to tackle a project I've been holding off on for many years: OCR my grandmother/great-grandmother's cookbook. It's about 100 pages of collected and annotated recipes from the 1930-1980s.
OCR and AI have become sufficiently capable (as you've demonstrated) to properly scan, index, and classify the recipes into something I can share with relatives online or as an ebook.
Why are the scans not deskewed and in some cases edges cut off ?
Fun fact: "Government mule" isn't just an expression, it's a real thing. And the U.S. government, including the Forest Service, still employs teams of mules to carry things to places that can't be reached any other way.
I did a quick search, mules are mentioned 75 different times. Like this one at random from Sept 1942: https://forestrydiary.com/page/019bd90a-f176-713f-9999-b14b6...
"Fix up my packs. Load the 2 mules with 225# each. Take the 2 loads to trail camp at Lake Everett, Unload. Have lunch with the Trail cook. Haze mules & ride to 7 1/2 PM."
Horses are mentioned 2586 times. That'd be a whole study on how they're used in the back country. (Edit: horse number is inflated since part of the diary form at one point asks for "Horse Mileage". Will have to refine search).
If you go backpacking in the Sierra Nevada (or other mountains, surely) you may just run into a mule train carrying a trail maintenance crew and their gear.
Well done! Have you uploaded these scans to the Internet Archive? If not, please consider doing so.
https://help.archive.org/help/uploading-a-basic-guide/
https://help.archive.org/help/managing-and-editing-your-item...
Trail Crew Stories and Mountain Gazette might also be interested in this.
Hadn't thought about it, but will take a look. Also, the two Forestry type links look very interesting. I figure there must be interest in this sort of thing - this is one resource, and the Stirling City Historical Society (Lassen NF) has a bunch of other documents I'd love to digitize soon.
[dead]
Crafted by Rajat
Source Code