Hey, all.
I didn’t see edu.gov on our list. I’m currently scraping. It’s been working for some time now. I’ll let you know when I have it here and how it looks.
Thanks.
2 Likes
thgie
2
Any success? I also got all the files from https://ocrdata.ed.gov/data
1 Like
AlRoeh
3
Hey,
I guess there is much more to this, but I’m looking into the ERIC fulltext db:
It’s huge. I’ve had to pause/resume more than once. It’s still working. Currently at just over 2GB and ~3k files.
Quick update. Well over 3.5GB now. Still going.
13.5GB now. Almost 8k files. This heavy lift is PDFs. There are hundreds or maybe thousands of them. And they’re huge. I don’t think anyone using that system has any knowledge on how to compress a file.
Any advice from anyone on this? Should I just keep going? The main thing is that I have no idea how deep it goes. It could be a TB. Can you use a command line to get a directory storage size for a domain?
@schoeneh?
1 Like
Have you got the storage capacity?
If yes, I say keep going
(Currently no idea re command line to get storage size)
1 Like
I have about 400GB. I can use about 300GB of that for this one site, then I’m stuck. And I’ll need to find a permanent home for it after that.
1 Like
OK, that should be enough - those are PDFs, not datasets.
Don’t worry about the permanent home, we got things in the works.
(A few days would be enough)
1 Like
Valjean
10
I have ~10tb available if you want me to make a copy. My internet is slow, though, and I can’t guarantee I won’t break something or lose files somehow, but I’m happy to hold a copy.
2 Likes
Thanks, everyone. If I get to a dangerous place, I’ll pause and report back on my status. Then we can figure out what to do from there.
4 Likes