Allow GCI.escapeHTML to take a custom escape table#55
Conversation
| string = string.b | ||
| string.gsub!(/['&\"<>]/, TABLE_FOR_ESCAPE_HTML__) | ||
| if escape_table | ||
| string.gsub!(p Regexp.union(escape_table.keys), escape_table) |
There was a problem hiding this comment.
| string.gsub!(p Regexp.union(escape_table.keys), escape_table) | |
| string.gsub!(Regexp.union(escape_table.keys), escape_table) |
No check for the keys and the values too?
| pattern = "[".encode(enc) | ||
| escape_table.each_key do |key| | ||
| pattern << Regexp.escape(key).encode(enc) | ||
| end | ||
| pattern << "]".encode(enc) | ||
| string = string.gsub(/#{pattern}/, table) |
There was a problem hiding this comment.
The keys may be longer than one here?
No check for the value encodings?
If it is OK, why not Regexp.union too?
| pattern = "[".encode(enc) | |
| escape_table.each_key do |key| | |
| pattern << Regexp.escape(key).encode(enc) | |
| end | |
| pattern << "]".encode(enc) | |
| string = string.gsub(/#{pattern}/, table) | |
| string = string.gsub(Regexp.union(escape_table.keys), table) |
There was a problem hiding this comment.
The keys may be longer than one here?
Indeed. But is it really a problem?
There was a problem hiding this comment.
No check for the value encodings?
They are encoded too:
Hash[escape_table.map {|pair| pair.map {|s|s.encode(enc)}}]| dynamic_escape_html(VALUE str, VALUE rb_escape_table) | ||
| { | ||
| VALUE escape_table[UCHAR_MAX+1] = {0}; | ||
| long max_escape_length = build_escape_table(escape_table, rb_escape_table); |
There was a problem hiding this comment.
The key lengths can be changed the methods of the latter keys.
May be better to separate filling the table and calculating the max length.
There was a problem hiding this comment.
The key lengths can be changed the methods of the latter keys.
I don't think I understand what you are referring to.
|
Sorry this is an old patch, not very fresh in my head. I'll try to address your feedback. |
653b625 to
8f73bcf
Compare
8f73bcf to
57a61dc
Compare
In some cases you may want to escape a string in a different way than
the default behavior.
For instance, if you are trying to make some JSON save to include
in a `<script>` tag, you may want to escape less, and using JavaScript
codepoints:
```ruby
>> CGI.escapeHTML('Hello </script>', ">" => '\u003e', "<" => '\u003c', "&" => '\u0026')
=> "Hello \\u003c/script\\u003e"
```
Of course you can always use `gsub` for that, but `CGI.escapeHTML` being
specialized is able to be very significantly faster:
```
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
gsub 82.135k i/100ms
escapeHTML 221.405k i/100ms
Calculating -------------------------------------
gsub 821.890k (± 2.2%) i/s (1.22 μs/i) - 4.189M in 5.099152s
escapeHTML 2.330M (± 0.5%) i/s (429.22 ns/i) - 11.734M in 5.036770s
Comparison:
escapeHTML: 2329816.5 i/s
gsub: 821889.7 i/s - 2.83x slower
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
gsub 36.235k i/100ms
escapeHTML 171.347k i/100ms
Calculating -------------------------------------
gsub 359.528k (± 1.5%) i/s (2.78 μs/i) - 1.812M in 5.040422s
escapeHTML 1.812M (± 0.7%) i/s (551.84 ns/i) - 9.081M in 5.011762s
Comparison:
escapeHTML: 1812105.3 i/s
gsub: 359527.5 i/s - 5.04x slower
```
```ruby
require "benchmark/ips"
require "cgi"
ESCAPE = {
">" => '\u003e', "<" => '\u003c', "&" => '\u0026',
}
ESCAPE_B = {
">".b => '\u003e'.b,
"<".b => '\u003c'.b,
"&".b => '\u0026'.b,
}
ESCAPE_REGEX = Regexp.union(ESCAPE_B.keys)
str = ("a" * 1024).freeze
Benchmark.ips do |x|
x.report("gsub") do
b = str.b
b.gsub!(ESCAPE_REGEX, ESCAPE_B)
b.force_encoding(str.encoding)
end
x.report("escapeHTML") do
CGI.escapeHTML(str, ESCAPE)
end
x.compare!
end
str = (("a" * 1023) + "<").freeze
Benchmark.ips do |x|
x.report("gsub") do
b = str.b
b.gsub!(ESCAPE_REGEX, ESCAPE_B)
b.force_encoding(str.encoding)
end
x.report("escapeHTML") do
CGI.escapeHTML(str, ESCAPE)
end
x.compare!
end
```
57a61dc to
51256f6
Compare
In some cases you may want to escape a string in a different way than the default behavior.
For instance, if you are trying to make some JSON safe to include in a
<script>tag, you may want to escape less, and using JavaScript codepoints:Of course you can always use
gsubfor that, butCGI.escapeHTMLbeing specialized is able to be very significantly faster:NB: I haven't implemented the Java version, but can do it if there is interest in this feature.