Recent Posts
Archives

PostHeaderIcon Java, SAX, XML: About special characters encoding

Unlike what most people think, characters such as &eacute are not part of XML specification! And SAX follows rigorously this spec…

A short history: in HTML, non standard ASCII characters, such as French ones (‘é’, ‘ë’, ‘è’, ‘à’, ‘ç’, etc.) were replaced with é, ë etc. From there, most people thought (and I did) this was a standard encoding in XML as in HTML.

But, when parsing XML file containing é characters, SAX parser raises following Exception:

[Fatal Error] toto.xml:3:59: The entity "eacute" was referenced, but not declared. org.xml.sax.SAXParseException: The entity "eacute" was referenced, but not declared. 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)

(By the way: there is a warning in Eclipse XML plugin for such XML file)

Indeed, the only special characters owing to XML standard are followings:

  • &
  • <
  • >
  • "
  • '

To encode French, Swedish or Chinese characters, or even a simple equivalent to   (non-breakable space), you have to use their decimal or hexadecimal equivalents. I use the following HashMap for most of common characters in French:

[java] public static HashMap<Character, String> XML_DECIMAL_ENCODING = new HashMap<Character, String>(); 	static 	{ 		XML_DECIMAL_ENCODING.put('œ', "œ"); 		XML_DECIMAL_ENCODING.put('À', "À"); 		XML_DECIMAL_ENCODING.put('Ä', "Ä"); 		XML_DECIMAL_ENCODING.put('Æ', "Æ"); 		XML_DECIMAL_ENCODING.put('Ç', "Ç"); 		XML_DECIMAL_ENCODING.put('È', "È"); 		XML_DECIMAL_ENCODING.put('É', "É"); 		XML_DECIMAL_ENCODING.put('Ë', "Ë"); 		XML_DECIMAL_ENCODING.put('Ï', "Ï"); 		XML_DECIMAL_ENCODING.put('Æ', "Ñ"); 		XML_DECIMAL_ENCODING.put('Ö', "Ö"); 		XML_DECIMAL_ENCODING.put('Ü', "Ü"); 		XML_DECIMAL_ENCODING.put('à', "à"); 		XML_DECIMAL_ENCODING.put('â', "â"); 		XML_DECIMAL_ENCODING.put('ä', "ä"); 		XML_DECIMAL_ENCODING.put('æ', "æ");                 XML_DECIMAL_ENCODING.put('ç', "ç"); 		XML_DECIMAL_ENCODING.put('è', "è"); 		XML_DECIMAL_ENCODING.put('é', "é"); 		XML_DECIMAL_ENCODING.put('ê', "ê"); 		XML_DECIMAL_ENCODING.put('ë', "ë");                 XML_DECIMAL_ENCODING.put('î', "î"); 		XML_DECIMAL_ENCODING.put('ï', "ï");                 XML_DECIMAL_ENCODING.put('ô', "ô"); 		XML_DECIMAL_ENCODING.put('ö', "ö");                 XML_DECIMAL_ENCODING.put('ù', "ù"); 		XML_DECIMAL_ENCODING.put('ñ', "ñ"); 		XML_DECIMAL_ENCODING.put('ü', "ü"); 		XML_DECIMAL_ENCODING.put('û', "û"); 	} 

Leave a Reply