Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Archive additional wiki resources (WIP) #223

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 21 additions & 2 deletions dumpgenerator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1171,6 +1171,12 @@ def getParameters(params=[]):
help='store only the current version of pages')
groupDownload.add_argument(
'--images', action='store_true', help="generates an image dump")
groupDownload.add_argument(
'--resources',
default="html",
choices=["html","dir","warc"],
help="""generate a backup of Main Page as HTML or with resources (CSS, etc.). The dir
and warc options require wget and may leave your IP address in the requisites.""")
groupDownload.add_argument(
'--namespaces',
metavar="1,2,3",
Expand Down Expand Up @@ -1200,7 +1206,7 @@ def getParameters(params=[]):
sys.exit(1)

# No download params and no meta info params? Exit
if (not args.xml and not args.images) and \
if (not args.xml and not args.images and args.resources == 'html') and \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and not?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept html as the default though (the same as the current version which saves only index.html). This is because the other settings don't remove the IP address (yet).

(not args.get_wiki_engine):
print 'ERROR: Use at least one download param or meta info param'
parser.print_help()
Expand Down Expand Up @@ -1335,6 +1341,7 @@ def getParameters(params=[]):
'xml': args.xml,
'namespaces': namespaces,
'exnamespaces': exnamespaces,
'resources': args.resources,
'path': args.path or '',
'cookies': args.cookies or '',
'delay': args.delay
Expand Down Expand Up @@ -1641,7 +1648,19 @@ def saveSpecialVersion(config={}, session=None):

def saveIndexPHP(config={}, session=None):
""" Save index.php as .html, to preserve license details available at the botom of the page """

escaped_index = "'" + config['index'].replace("'", "'\\''") + "'"
escaped_path = "'" + (config['path'] + '/requisites').replace("'", "'\\''") + "'"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use subprocess.check_call? This seems simple enough a command, should work I think.

wget_dir = 'wget -e robots=off -p -k -H -nd -P %s --restrict-file-names=windows'
wget_dir %= escaped_path
wget_warc = wget_dir + ' --warc-file=%s' % escaped_path
if config['resources'] == 'warc':
print 'Downloading index.php (Main Page) with all resources to requisites and requisites.warc.gz'
os.system(wget_warc + ' ' + escaped_index)
return
if config['resources'] == 'dir':
print 'Downloading index.php (Main Page) with all resources to requisites'
os.system(wget_dir + ' ' + escaped_index)
return
if os.path.exists('%s/index.html' % (config['path'])):
print 'index.html exists, do not overwrite'
else:
Expand Down