New on LowEndTalk? Please Register and read our Community Rules.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
All new Registrations are manually reviewed and approved, so a short delay after registration may occur before your account becomes active.
How to Cache or copy a website?
Tenshi_420
Member
I am trying to cache or copy a (an entire website) websites contents to save as a backup. I am doing this out of self preservation not to clone or phishing sites. How can I go about this ?
The site contains mostly text with images. Nothing to fancy just cache/copy.
In case something happens I want to be able to offer copies in case of an emergency. Its a bit of a secret project and very very time sensitive project that I am trying to accomplish as soon as I can.
Thanks for taking time to read. (:
P.S. I do not own or have access to this website other than its www .
Comments
HTTrack should do the trick: http://www.httrack.com/
Wget will the do job
wget \ --recursive \ --no-clobber \ --page-requisites \ --html-extension \ --convert-links \ --restrict-file-names=windows \ --domains www.domain.com \ --no-parent \ http://www.domain.com/folder/
Just as a friendly reminder by the admin of a website that gets "copied" every day by people who want to "archive" the content: It's a pain in the ass and can put some heavy load on the server and causes quite some trouble to the admin. Be gentle...
For the others thanks, I'll try it when I have more time and;
Oh thanks for the heads up, I think I would cache (when I can) a page per 9 seconds, maybe write a cheesy workable bash script and cron it-- with supervision)
What site do you run O.O
I am responsible for the technical part (not the content) of a niche adult gallery. 15,000+ images and there are people who try to "cache" them all locally on a daily basis...
@Amitz can you share a link please?
Link will be better.
No, sorry.
I do not want the site to get associated (publicly) with me. I just take care about the backend, not the content.
Use wget, as suggested. There's lots of commandline options to do what you want, including "mirror".
And there's a "wait" parameter (-w) to insert a delay between requests so you don't create an issue like this. Use a healthy wait and cue the job up before going to bed... Next morning your good to go.
easy way to do it on you home desktop:
WinHTTrack
Download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer.
Personally I block it on my servers for the abuse issues that @Amitz raised above.
Create a weekly archive by cron(may create a load on server, but its only once a week). Use nginx on an unmetered server to host the archive. :P