posts.xml contains the htmlified versions of the posts, but posthistory.xml contains the markdown source.
]]>
actually, by looking at posts.xml we're actually looking at the final rendered HTML for posts. That is, the markdown syntax for including images that you mention has already been converted to standard HTML <img/> tags.
]]>![ gives me 9 occurrences, which are not links. Either I am not escaping the pattern correctly (I never remember what to excape when using what tool :/ ) or the links are stored in one format and presented to the user (when editing, say) in another.
]]>
grep -o "<img src="[^&]*"" posts.xml | sed -e "s/.*"\(.*\.\(png\|gif\|jpg\)\)"/\1/" | xargs -1 wget
LATER: In fact, quite a few of the img tags are to latex.mathoverflow.net, which one does not want, so
grep -o "<img src="[^&]*"" posts.xml
| sed -e '/latex.mathoverflow.net/d' -e 's/<img src=*"\(.*\)"/\1/'
| xargs -n 1 wget
is a better alternative.
By the way, with the last dump
grep -o "<img src="[^&]*"" posts.xml
| sed -e '/latex.mathoverflow.net/d' -e 's/<img src=*"\(.*\)"/\1/'
| xargs -n 1 HEAD -d -t 3
| sort
| uniq -c
(which uses a short timeout) returns
505 200 OK
40 204 No Content
55 403 Forbidden
19 404 Not Found
1 404 NOT FOUND
3 405 Method Not Allowed
4 500 Can't connect to cs.smith.edu:80 (connect: timeout)
1 500 Can't connect to img843.imageshack.us:80 (connect: timeout)
1 500 Can't connect to math.huji.ac.il:80 (connect: timeout)
3 500 Can't connect to maven.smith.edu:80 (connect: timeout)
1 500 Can't connect to upload.wikimedia.org:80 (connect: timeout)
5 500 Can't connect to www.freeimagehosting.net:80 (connect: timeout)
2 500 Can't connect to www.math.hawaii.edu:80 (connect: timeout)
1 500 Can't connect to www.maths.ed.ac.uk:80 (connect: Connection refused)
22 500 read timeout
5 501 Protocol scheme 'https' is not supported (Crypt::SSLeay or IO::Socket::SSL not installed)
(
]]>grep -o "<a href="[^&]*"" < posts.xml | sed -e "s/<a href="\(.*\)"/\1/"
will give you a list of all links. (Sorry, my bash scripting doesn't extend to awk, or whatever one is really meant to use here.) After that you'd want to choose things likely to be images, and download them. The miracle still comes later.
]]>Of course the usual applies --- we have no control over the software we run, and, as discussed on another thread here, migrating to SE 2.0 looks unlikely for now.
]]>