<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Recent posts to Discussion</title><link>https://sourceforge.net/p/jerichohtml/discussion/</link><description>Recent posts to Discussion</description><atom:link href="https://sourceforge.net/p/jerichohtml/discussion/feed.rss" rel="self"/><language>en</language><lastBuildDate>Sun, 29 Sep 2024 04:47:33 -0000</lastBuildDate><atom:link href="https://sourceforge.net/p/jerichohtml/discussion/feed.rss" rel="self" type="application/rss+xml"/><item><title>OutOfMemory error when parsing</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/01f1eed4c4/?limit=25#5d58</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Andy. Thanks for reporting the issue.&lt;/p&gt;
&lt;p&gt;I see what you mean about sourceforge. I just noticed they removed all of the documentation from my project's website a couple of months ago without notification. I just fixed that. But no I don't have any intention of moving the project to github at this point in time.&lt;/p&gt;
&lt;p&gt;Firstly, you might like to try using the latest DEV version 3.5. There have been a few improvements and bug fixes to the Renderer class. You can download it here:&lt;br/&gt;
&lt;a href="http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip" rel="nofollow"&gt;http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip&lt;/a&gt;&lt;br/&gt;
The 3.5-dev version is always a release candidate and can be used as a reliable substitute for the last official 3.4 release.&lt;/p&gt;
&lt;p&gt;According to the release notes there are no bug fixes that look related to the issue you're experiencing, but it's worth a try.&lt;/p&gt;
&lt;p&gt;Another thing, you don't need the following line, as Source is already a subclass of Segment. Just use &lt;code&gt;new Renderer(htmlSource)&lt;/code&gt; instead.&lt;br/&gt;
&lt;code&gt;Segment htmlSeg = new Segment(htmlSource, 0, htmlSource.length());&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;If using 3.5-dev doesn't solve your problem, send me a link to the problematic source document and I'll take a look.&lt;/p&gt;
&lt;p&gt;Let me know how it goes.&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Sun, 29 Sep 2024 04:47:33 -0000</pubDate><guid>https://sourceforge.net6c9acd2b435fbada7a88d2dd9af78c6188dbbac2</guid></item><item><title>Module System support</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/a0c5ee3b9c/?limit=25#1ced</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Ethan,&lt;/p&gt;
&lt;p&gt;Thank you for the suggestion. Yes I got a request for this already last year:&lt;br/&gt;
&lt;a href="https://sourceforge.net/p/jerichohtml/bugs/93/"&gt;https://sourceforge.net/p/jerichohtml/bugs/93/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The biggest barrier at the moment is the fact that I implemented a new major feature a few years ago (a web crawler API) but it remains poorly documented, and could probably use a couple of minor enhancements before it is officially released. That means all bug fixes since then have just gone into the DEV release:&lt;br/&gt;
&lt;a href="http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip" rel="nofollow"&gt;http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The DEV release is very stable, with no known bugs, and recommended for production use. I just haven't released it officially yet because of the missing documentation.&lt;/p&gt;
&lt;p&gt;That DEV download probably includes updates that I never committed to the Bazaar repository, so you should consider that one the latest version. In fact I don't even remember why I bothered with a source code repository in the first place, as  external contributions have been negligible.&lt;/p&gt;
&lt;p&gt;Unfortunately due to other priorities I simply don't have time to look at doing a new official release. It might be time to hand the whole thing over to someone else, but not sure if anyone would be interested. And doing a handover would no doubt involve more work than an official release. So the project is sort of in limbo.&lt;/p&gt;
&lt;p&gt;This library doesn't actually have any external dependencies, so one work-around would be to just add the source code to your own project. If you notice a bug you can always just file sync your copy with the one in the DEV zip file. Bug fixes these days are extremely rare and minor. Would that approach work for you?&lt;/p&gt;
&lt;p&gt;Cheers&lt;br/&gt;
Martin&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Mon, 21 Aug 2023 08:38:51 -0000</pubDate><guid>https://sourceforge.net33b86dbf32defb8daaf84367db6c9a0653c20346</guid></item><item><title>Jericho 3.4/3.5-dev parsing bug(?)</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/32c300cbf3/?limit=25#e6e4</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;P.S. When you want to include HTML  in your post, you need to enclose it in a code block, otherwise the HTML is parsed and doesn't show properly.&lt;/p&gt;
&lt;p&gt;For example, your sample document should look like this:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;html&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;head&amp;gt;&lt;/span&gt;
    &lt;span class="nt"&gt;&amp;lt;meta&lt;/span&gt; &lt;span class="na"&gt;http-equiv=&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;quot;Content-Type&amp;amp;quot;&lt;/span&gt; &lt;span class="na"&gt;content=&lt;/span&gt;&lt;span class="s"&gt;&amp;amp;quot;html;&lt;/span&gt; &lt;span class="na"&gt;charset=&lt;/span&gt;&lt;span class="s"&gt;UTF-8&amp;amp;quot;&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
  &lt;span class="nt"&gt;&amp;lt;/head&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/html&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;
&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Wed, 07 Sep 2022 01:41:31 -0000</pubDate><guid>https://sourceforge.net28db1abae5516f253df9a30fd5910e091823541f</guid></item><item><title>Jericho 3.4/3.5-dev parsing bug(?)</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/32c300cbf3/?limit=25#16a4</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Davy,&lt;/p&gt;
&lt;p&gt;The sample HTML you are feeding it doesn't specify a valid encoding and is therefore parsed correctly. Because the quotes are encoded, they are included in the value of the content attribute, which is why the end quote is interpreted as part of the encoding name.&lt;/p&gt;
&lt;p&gt;You say that the sample content occurs when it is "inserted into an iframe". I assume you mean it appears as the value of the iframe srcdoc attribute.&lt;/p&gt;
&lt;p&gt;In that case, your sample document should be the HTML containing the iframe, not the value of the srcdoc attribute in isolation.&lt;/p&gt;
&lt;p&gt;If you run the parent HTML document through the parser, it should correctly pick up the encoding specified in the parent HTML document, and not try to interpret the encoding in the srcdoc attribute.&lt;/p&gt;
&lt;p&gt;If you want to parse the child document HTML from the iframe in isolation, you need to decode the value of the srcdoc attribute first. Then the parser will correctly detect the encoding of the child document.&lt;/p&gt;
&lt;p&gt;I hope that makes sense!&lt;/p&gt;
&lt;p&gt;Cheers&lt;br/&gt;
Martin&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Wed, 07 Sep 2022 01:38:16 -0000</pubDate><guid>https://sourceforge.netda5496f7684a7cd10642d27eaf997299bc9fc64a</guid></item><item><title>Jericho removing Button elements</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/8887a6506f/?limit=25#e7e7</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Martin,&lt;/p&gt;
&lt;p&gt;Thanks a lot for patching this. I now get expected behaviour!&lt;/p&gt;
&lt;p&gt;Kind regards,&lt;br/&gt;
Remi&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Remi Rosenthal</dc:creator><pubDate>Wed, 29 Jun 2022 10:35:09 -0000</pubDate><guid>https://sourceforge.net4fd2a2eb1f588b74310dbd789fce694559dd9d89</guid></item><item><title>Jericho removing Button elements</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/8887a6506f/?limit=25#43c6</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Remi,&lt;/p&gt;
&lt;p&gt;I didn't document anywhere why I made the decision to remove the content of button elements. In general I was copying the behaviour of how some email clients create pure text versions of HTML emails. Maybe I just thought they should be removed because all other form elements (INPUT, TEXTAREA etc) are removed. Or maybe I just didn't think much about it!&lt;/p&gt;
&lt;p&gt;I've modified the Render class in  version 3.5 to include the content of BUTTON elements.&lt;/p&gt;
&lt;p&gt;Until version 3.5 is officially released, the development version is available here:&lt;br/&gt;
&lt;a href="http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip" rel="nofollow"&gt;http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The development version is always pretty much as stable as an official release. It has been a long time since an official release because the new WebBot functionality hasn't been documented yet, and I simply don't have time to work on it.&lt;/p&gt;
&lt;p&gt;Cheers&lt;br/&gt;
Martin&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Wed, 29 Jun 2022 10:17:02 -0000</pubDate><guid>https://sourceforge.net6b63938e60d7fd249626db54f3c39ffec976dcf3</guid></item><item><title>Jericho removing Button elements</title><link>https://sourceforge.net/p/jerichohtml/discussion/350025/thread/8887a6506f/?limit=25#731d</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi,&lt;/p&gt;
&lt;p&gt;I've noticed that the Jericho Renderer doesn't include Button elements in its &lt;code&gt;toString()&lt;/code&gt;. This is presumably because &lt;code&gt;button&lt;/code&gt; is mapped to a RemoveElementHandler in Renderer.&lt;br/&gt;
I would be interested to hear the rationale behind this, but more importantly, is there a way to override this behaviour on my end?&lt;/p&gt;
&lt;p&gt;You can reproduce with something as simple as:&lt;br/&gt;
&amp;lt;button&amp;gt;My Button&amp;lt;/button&amp;gt;&lt;br/&gt;
Which will result in an empty string.&lt;/p&gt;
&lt;p&gt;Many thanks&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Remi Rosenthal</dc:creator><pubDate>Wed, 29 Jun 2022 09:25:47 -0000</pubDate><guid>https://sourceforge.net84490ccd1b6e0bec3d3c9491d27fdb704d33ffb8</guid></item><item><title>3.5</title><link>https://sourceforge.net/p/jerichohtml/discussion/350024/thread/203f67def1/?limit=25#b5fd</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Andrew,&lt;/p&gt;
&lt;p&gt;The release.txt file does mention "minor changes to Renderer behaviour" for version 3.5. The new behaviour is more consistent with browser behaviour so it is most likely an intended change.&lt;/p&gt;
&lt;p&gt;Cheers&lt;br/&gt;
Martin&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Sun, 20 Feb 2022 16:44:58 -0000</pubDate><guid>https://sourceforge.neta9afe7c0fc9f35b0b3bdf1bad6fca50f37b897ba</guid></item><item><title>3.5</title><link>https://sourceforge.net/p/jerichohtml/discussion/350024/thread/203f67def1/?limit=25#7e0d</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Martin,&lt;/p&gt;
&lt;p&gt;Thanks for responding so quickly. Since my last message, I've been trying out 3.5-dev as I was hoping to take advantage of the memory consumption improvements, but have come across a behaviour difference for the Renderer between 3.4 and 3.5.&lt;/p&gt;
&lt;p&gt;For example, &lt;code&gt;&amp;lt;p&amp;gt;Hello&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;&amp;lt;br&amp;gt;&amp;lt;/p&amp;gt;&amp;lt;p&amp;gt;There&amp;lt;/p&amp;gt;&lt;/code&gt; used to output  &lt;code&gt;Hello\r\n\r\nThere&lt;/code&gt;&lt;br/&gt;
But now in 3.5-dev it outputs  &lt;code&gt;Hello\r\n\r\n\r\nThere&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Is this an expected behaviour change? I have attached a screenshot of the Renderer configured&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Andrew Smith</dc:creator><pubDate>Fri, 04 Feb 2022 15:09:36 -0000</pubDate><guid>https://sourceforge.net7aeaacac851d5d4866b9c6b33b9decd1e5edf0cb</guid></item><item><title>3.5</title><link>https://sourceforge.net/p/jerichohtml/discussion/350024/thread/203f67def1/?limit=25#c437</link><description>&lt;div class="markdown_content"&gt;&lt;p&gt;Hi Andrew. Version 3.5 hasn't been officially released yet because the newest feature, a web crawler API, has not been fully documented yet.&lt;br/&gt;
The project is not dead, and minor improvements continue to make their way into the DEV version, but other time commitments have prevented the completion of the documentation and an official release for years.&lt;br/&gt;
The 3.5-dev version (http://jericho.htmlparser.net/temp/jericho-html-3.5-dev.zip) is always a release candidate and can be used as a reliable substitute for the last official 3.4 release.&lt;/p&gt;&lt;/div&gt;</description><dc:creator xmlns:dc="http://purl.org/dc/elements/1.1/">Martin Jericho</dc:creator><pubDate>Wed, 02 Feb 2022 15:48:13 -0000</pubDate><guid>https://sourceforge.netb4d01f80dbf7ac65ddd532b4f23e43c3382c4b55</guid></item></channel></rss>