Comparison of HTML parsers
HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes:
- HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers.
- HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy.
Parser | License | Implementation language(s) | Latest date* | HTML parsing[1] | HTML5-compliant parsing | Clean HTML** | Update HTML*** |
---|---|---|---|---|---|---|---|
html.parser | Python S. F. L. | Python | 2016-06-27[2] | Yes | ? | No | No |
Html Agility Pack | Microsoft Public License | C# | 2016-07-14[3] | Yes | ? | No | ? |
Beautiful Soup | Python S. F. L. | Python | 2016-08-02[4] | Yes | Partial[5] | Yes | Yes |
Gumbo | Apache License 2.0 | C | 2015-05-01 | Yes | Yes | ? | ? |
html5ever | Apache License 2.0 | Rust | 2016-02-23 | Yes | Yes | ? | ? |
html5lib | MIT License | Python (and PHP, six years ago) | 2016-07-15[6] | Yes | Yes | Yes | No |
HTML::Parser | Perl license | Perl | 2013-03-28 | Yes | No[7] | ? | ? |
htmlPurifier | GNU Lesser GPL | PHP | 2009-03-25[8] | No | No | Yes | Yes |
HTML Tidy | W3C license | ANSI C | 2015-05-24[9] | No[10] | No | Yes[10] | Yes[10] |
HtmlUnit | Apache License 2.0 | Java | 2016-05-27[11] | Yes | ? | No | No |
HtmlCleaner | BSD License[12] | Java | 2015-08-24 | No | No | Yes | ? |
Hubbub | MIT License | C | 2016-02-16 | Yes | Yes[13] | ? | ? |
Jaunt API | Jaunt Beta License | Java | 2013-08-01 | Yes | ? | Yes | No |
Jericho HTML Parser | Eclipse Public License | Java | 2012-10-30[14] | No?? | ? | ? | ? |
jsdom | MIT license | JavaScript | 2013-07-21 | No | ? | ? | ? |
jsoup | MIT license | Java | 2016-10-23[15] | Yes | Yes[16] | Yes | Yes |
JTidy | JTidy License | Java | 2012-10-09[17] | No | ? | Yes | ? |
libxml2 HTMLparser | MIT License | C | 2012-09-11[18] | Yes | No | ? | ? |
NekoHTML | Apache License 2.0 | Java | 2014-06-02[19] | No | ? | ? | ? |
TagSoup | Apache License 2.0 | Java | 2011-07-07 | No | ? | ? | ? |
Validator.nu HTML Parser | MIT License | Java | 2012-06-05 | Yes | Yes | ? | ? |
PHP Simple HTML DOM Parser | MIT License | PHP | 2014-08-28 | Yes | ? | No | No |
The PHP DOMDocument-class | PHP License | PHP | 2014-10-04 | Yes | ? | No | No |
Nokogiri | MIT License | Ruby | 2016-10-03[20] | Yes | ? | No | No |
AVHTML | AGPL | C++ | 2015-08-27[21] | Yes | ? | No | Yes |
BrilliantHTML5Parser | Apache License 2.0 | Swift 3 | 2016-11-10 | Yes | ? | No | No |
MyHTML | LGPL | C | 2016-11-15 | Yes | Yes | No | No |
Parser | License | Implementation language(s) | Latest date* | HTML Parsing | HTML5-compliant Parsing | Clean HTML** | Update HTML*** |
- * Latest release (of significant changes) date.
- ** sanitize (generating standard-compatible web-page, reduce spam, etc.) and clean (strip out surplus presentational tags, remove XSS code, etc.) HTML code.
- *** Updates HTML4.X to XHTML or to HTML5, converting deprecated tags (ex. CENTER) to valid ones (ex. DIV with style="text-align:center;").
References
- ↑ 12.2 Parsing HTML documents — HTML Standard
- ↑ Python 3.5.2
- ↑ Nuget Html AgilityPack
- ↑ Beautiful Soup 4.5.1
- ↑ via html5lib
- ↑ Releases · html5lib/html5lib-python
- ↑ Bug #53300 for HTML-Parser: HTML 5
- ↑ HTML Tidy for Windows
- ↑ HTML Tidy release 4.9.30
- 1 2 3 What is Tidy?
- ↑ HtmlUnit Release 2.22 Changes
- ↑ HtmlCleaner is distributed under BSD License
- ↑ according to project's home page
- ↑ Jericho HTML Parser - Browse /jericho-html/3.3 at SourceForge.net
- ↑ jsoup release 1.10.1
- ↑ https://jsoup.org/ Per project homepage
- ↑ JTidy - Browse /JTidy at SourceForge.net
- ↑ libxml2 Releases
- ↑ NekoHTML | Change History
- ↑ Nokogiri release 1.6.8.1
- ↑ Latest commit 8c0d99f on 27 Aug 2015
This article is issued from Wikipedia - version of the 12/3/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.