icy does git — grayfriday: 1bb1d0171c89c81a6af54672f70505857925b5f2

Fixed HTML entity regex (#453)

The old regex missed a lot of HTML entities, like long references
(from 6-character entites like &approx; to the somewhat rarer
&CounterClockwiseContourIntegral;) as well as numeric references
(decimal e.g. &#1234; or hex e.g. &#x13AF6;). This fixes that.

Rebecca Turner 637275@gmail.com

Thu, 24 May 2018 14:32:58 -0400

commit

1bb1d0171c89c81a6af54672f70505857925b5f2

parent

8c0d4cca9461dcee0d3220518532d81e5ae5ff0a

1 files changed, 16 insertions(+), 2 deletions(-)

jump to

inline.go

M inline.go → inline.go

@@ -23,8 +23,22 @@ var (
 	urlRe    = `((https?|ftp):\/\/|\/)[-A-Za-z0-9+&@#\/%?=~_|!:,.;\(\)]+`
 	anchorRe = regexp.MustCompile(`^(<a\shref="` + urlRe + `"(\stitle="[^"<>]+")?\s?>` + urlRe + `<\/a>)`)
 
-	// TODO: improve this regexp to catch all possible entities:
-	htmlEntityRe = regexp.MustCompile(`&[a-z]{2,5};`)
+	// https://www.w3.org/TR/html5/syntax.html#character-references
+	// highest unicode code point in 17 planes (2^20): 1,114,112d =
+	// 7 dec digits or 6 hex digits
+	// named entity references can be 2-31 characters with stuff like &lt;
+	// at one end and &CounterClockwiseContourIntegral; at the other. There
+	// are also sometimes numbers at the end, although this isn't inherent
+	// in the specification; there are never numbers anywhere else in
+	// current character references, though; see &frac34; and &blk12;, etc.
+	// https://www.w3.org/TR/html5/syntax.html#named-character-references
+	//
+	// entity := "&" (named group | number ref) ";"
+	// named group := [a-zA-Z]{2,31}[0-9]{0,2}
+	// number ref := "#" (dec ref | hex ref)
+	// dec ref := [0-9]{1,7}
+	// hex ref := ("x" | "X") [0-9a-fA-F]{1,6}
+	htmlEntityRe = regexp.MustCompile(`&([a-zA-Z]{2,31}[0-9]{0,2}|#([0-9]{1,7}|[xX][0-9a-fA-F]{1,6}));`)
 )
 
 // Functions to parse text within a block

all repos — grayfriday @ 1bb1d0171c89c81a6af54672f70505857925b5f2

blackfriday fork with a few changes