XMLSlurper really slow reading/parsing html/xml file

Discussion:

mjfan80

2011-04-07 12:55:02 UTC

In my project I need to parse a HTML file (well formatted, as Xhtml).
This is html file is not too big... some styles, a Table with many td, and
other stuff.
The HTML file is 8KB (not big)

The same file is parsed by the pdf plugin (so flying saucer) to make a PDF
and this is really quick (less then one second, I think)

But if the same file is parsed by xmlslurper it takes 80 seconds.... yes,
80seconds...
I tryed with XMLSlurper, XMLParser and also the java XMLStreamReader. and it
takes beetween 70 to 80 seconds
I Don't know why is so slowly

The html file is stored locally on the server (so no time for download)

this is the what i do to find in the HTML file a
with the class setted to "report" (then i will do something with this
table)

def docParser = new XmlParser().parse(urlFile)
def body = doc.'body'
def report = trovaTableReport(body);

public GPathResult trovaTableReport(GPathResult nodo) {
if(nodo != null) {
def eventualiTable = nodo.'table'
def report = eventualiTable.find { ***@class.text().contains("report") }
if(report != null && !report.isEmpty()) return report
else {
def reportInterno = null
nodo.children().each() {figlio ->
reportInterno = trovaTableReport(figlio)
if(reportInterno != null && !reportInterno.isEmpty()) report =
reportInterno
}
if(report != null && !report.isEmpty()) return report
else return null
}
}
else return null
}

SomeOne can tell mw why it takes so long to parse a simple html file?

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3433305.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

Wolfgang Schell

2011-04-08 08:03:40 UTC

Permalink

Does your (X)HTML file has a DTD, XML Schema declaration or something, which
contains external URLs? Maybe the parser reaches out into the Net and tries
to locate DTDs or XML Schemas?

HTH,

Wolfgang

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3435558.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

mjfan80

2011-04-11 07:06:58 UTC

Permalink

I tried to delete the doctype declaration (that has a external dtd
declaration) but nothing, time is still 80 seconds

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

Luis Muniz

2011-04-11 14:09:09 UTC

Permalink

Is it this line taht takes 80s?
def docParser = new XmlParser().parse(urlFile)

otherwise, maybe you can put some timing println statements in your code, it
could be one of the GPATH expressions in your loops that takes so long.

Post by mjfan80
I tried to delete the doctype declaration (that has a external dtd
declaration) but nothing, time is still 80 seconds
--
http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html
Sent from the Grails - user mailing list archive at Nabble.com.
---------------------------------------------------------------------
http://xircles.codehaus.org/manage_email

mjfan80

2011-04-11 14:27:34 UTC

Permalink

yes, is that line
i tryed with many println monitoring the time, and this is the line that
takes beetween 70 to 80 seconds

either with parse(file) or parsetext(a string with cone file content)

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442119.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

mjfan80

2011-04-11 14:27:49 UTC

Permalink

Post by Luis Muniz
Is it this line taht takes 80s?
def docParser = new XmlParser().parse(urlFile)
otherwise, maybe you can put some timing println statements in your code, it
could be one of the GPATH expressions in your loops that takes so long.

________________________________
If you reply to this email, your message will be added to the discussion
http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442069.html
To unsubscribe from XMLSlurper really slow reading/parsing html/xml file,
click here.

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442120.html
Sent from the Grails - user mailing list archive at Nabble.com.

Luis Muniz

2011-04-11 14:31:25 UTC

Permalink

Then I'm afraid that the only idea I'd have is profiling the process, if
there really is nothing special in the html file.

Another (laborious) process would be to progressively eliminate complexity
from the html file and retry the parsing every time, to find out what is the
cause of the delay.

Post by mjfan80
yes, is that line
i tryed with many println monitoring the time, and this is the line
that takes beetween 70 to 80 seconds
either with parse(file) or parsetext(a string with cone file content)
2011/4/11 Luis Muniz-2 [via Grails]

Post by Luis Muniz
Is it this line taht takes 80s?
def docParser = new XmlParser().parse(urlFile)
otherwise, maybe you can put some timing println statements in your code,

Post by Luis Muniz
could be one of the GPATH expressions in your loops that takes so long.

Post by mjfan80
I tried to delete the doctype declaration (that has a external dtd
declaration) but nothing, time is still 80 seconds
--

http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html<http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3441234.html?by-user=t>

Post by Luis Muniz

Post by mjfan80
Sent from the Grails - user mailing list archive at Nabble.com.
---------------------------------------------------------------------
http://xircles.codehaus.org/manage_email

________________________________
If you reply to this email, your message will be added to the discussion

http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442069.html<http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442069.html?by-user=t>

Post by Luis Muniz
To unsubscribe from XMLSlurper really slow reading/parsing html/xml file,
click here.

------------------------------
View this message in context: Re: XMLSlurper really slow reading/parsing
html/xml file<http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442120.html>
Sent from the Grails - user mailing list archive<http://grails.1312388.n4.nabble.com/Grails-user-f1312389.html>at Nabble.com.

mjfan80

2011-04-11 15:47:24 UTC

Permalink

With this html file it takes 80 seconds (3KB)

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Elenca Chiamate

Report elencazione chiamate

/HelpDeskGwt/images/logo_gestore_120X100.gif

Provincia di Varese - Lotto N° 2 (MANUTENCOOP)

Via dei Tigli 10

GALLARATE (VA)

tel 0331 793610 - fax 0331 791652

P.IVA - C.F.:

Report elencazione chiamate

/HelpDeskGwt/images/logo_gestore_50X40.gif

Provincia di Varese - Lotto N° 2 (MANUTENCOOP)

Via dei Tigli 10

GALLARATE (VA)

tel 0331 793610 - fax 0331 791652

P.IVA - C.F.:

Id Chiamata
Tipo
Grado Diss.
Stato
Compl
Impianto
Descr. Imp.
Ubicazione
Nome
Cognome
Apertura
Chiusura
Guasto
Note
Assegnazione
Note Ass.

2011/00469/C
IDRO-SANITARIA
ALTO
R
N
ED_027.A
I.I.S. "Gadda-Rosselli"
VIA DE ALBERTIS 3
Gaetana
Pellegrino
07/02/2011 - 11:08

PIAN TERRENO LATO SUD - NEI BAGNI FEMMINILI MANCANO 2 MANIGLIE PASSI
RAPIDI
ORARIO APERTUTA: 07.45-13.45

I the tryed many time with some combination... and I found out that the
problem is the doctype declaration... deleting that (and than making a
grails clean and a browser cache deleting) resolved the problem

Thanks

--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3442364.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email

John Thompson

2011-04-12 10:37:08 UTC

Permalink

After looking at the API, I wonder if making it not namespace aware would
make a difference.

something like:
def rootNode = new XmlSlurper(false, false).parseText(foo_string)

http://groovy.codehaus.org/api/groovy/util/XmlSlurper.html

-----
JT
jts-blog.com
--
View this message in context: http://grails.1312388.n4.nabble.com/XMLSlurper-really-slow-reading-parsing-html-xml-file-tp3433305p3444258.html
Sent from the Grails - user mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe from this list, please visit:

http://xircles.codehaus.org/manage_email