* html files are now stored as follows: If the html file is valid xml, store as html/stuff.xml If it's not, store as html/stuff.xml, which contains <html meta1="..." filename="stuff.html">, and html/stuff.html, which actually contains the contents. Warn if the contents are not parseable with lxml's html parser, but don't error. * for parseable html, strip out the html tag when storing, so that it isn't rendered into the middle of a page * lots of backcompat to deal with paths. Can go away soon. * fix output ordering in clean_xml
21 lines
708 B
Python
21 lines
708 B
Python
from itertools import chain
|
|
from lxml import etree
|
|
|
|
def stringify_children(node):
|
|
'''
|
|
Return all contents of an xml tree, without the outside tags.
|
|
e.g. if node is parse of
|
|
"<html a="b" foo="bar">Hi <div>there <span>Bruce</span><b>!</b></div><html>"
|
|
should return
|
|
"Hi <div>there <span>Bruce</span><b>!</b></div>"
|
|
|
|
fixed from
|
|
http://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml
|
|
'''
|
|
parts = ([node.text] +
|
|
list(chain(*([etree.tostring(c), c.tail]
|
|
for c in node.getchildren())
|
|
)))
|
|
# filter removes possible Nones in texts and tails
|
|
return ''.join(filter(None, parts))
|