Hey, all.

I didn’t see edu.gov on our list. I’m currently scraping. It’s been working for some time now. I’ll let you know when I have it here and how it looks.

Thanks.

2 Likes

Any success? I also got all the files from https://ocrdata.ed.gov/data

1 Like

Hey,
I guess there is much more to this, but I’m looking into the ERIC fulltext db:

It’s huge. I’ve had to pause/resume more than once. It’s still working. Currently at just over 2GB and ~3k files.

Quick update. Well over 3.5GB now. Still going.

13.5GB now. Almost 8k files. This heavy lift is PDFs. There are hundreds or maybe thousands of them. And they’re huge. I don’t think anyone using that system has any knowledge on how to compress a file.

Any advice from anyone on this? Should I just keep going? The main thing is that I have no idea how deep it goes. It could be a TB. Can you use a command line to get a directory storage size for a domain?

@schoeneh?

1 Like

Have you got the storage capacity?
If yes, I say keep going

(Currently no idea re command line to get storage size)

1 Like

I have about 400GB. I can use about 300GB of that for this one site, then I’m stuck. And I’ll need to find a permanent home for it after that.

1 Like

OK, that should be enough - those are PDFs, not datasets.
Don’t worry about the permanent home, we got things in the works.
(A few days would be enough)

1 Like

I have ~10tb available if you want me to make a copy. My internet is slow, though, and I can’t guarantee I won’t break something or lose files somehow, but I’m happy to hold a copy.

2 Likes

Thanks, everyone. If I get to a dangerous place, I’ll pause and report back on my status. Then we can figure out what to do from there.

4 Likes