Ruby: Patch to fix broken YAML.dump for multi-line strings (String#to_yaml)

I love YAML. It's portable. The majority of languages I work with already have libraries built to read/write it. It's more lightweight, more expressive, and easier to read than XML. I really love YAML.

Unfortunately, it turns out that Ruby's YAML library is a little incomplete for some edge cases. I just discovered that YAML.dump fails to generate valid YAML for certain multi-line strings. The specifics are very difficult to explain, and we'll summarize them at the end after trying out some examples. Fire up irb and give the following a try.

require 'yaml'
s = "\n Something embedded.\nAnd on a new line."
YAML.load(YAML.dump(s))

You'll receive the following error (ruby 1.8.6 (2007-09-24 patchlevel 111) -- yes I know this version is unpatched)

ArgumentError: syntax error on line 3, col 0: `And on a new line.'
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/yaml.rb:133:in `load'
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/yaml.rb:133:in `load'
	from (irb):73
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rational.rb:320

Note that the same problem exists in the latest patched ruby enterprise edition (ruby 1.8.6 (2008-08-08 patchlevel 286)).

ArgumentError: syntax error on line 3, col 0: `And on a new line.'
	from /opt/ruby-enterprise-1.8.6-20080810/lib/ruby/1.8/yaml.rb:133:in `load'
	from /opt/ruby-enterprise-1.8.6-20080810/lib/ruby/1.8/yaml.rb:133:in `load'
	from (irb):3

So, what the heck is going on here? I mean, this isn't a very complex string by any stretch of the imagination. Let's try a few other variations in irb:

>> require 'yaml'
=> true
>> s1 = " Do I work?\nNo indent"
=> " Do I work?\nNo indent"
>> YAML.load(YAML.dump(s1))
=> " Do I work?\nNo indent"
>>
?>
?> s2 = " \n Do I work?\nNo indent"
=> " \n Do I work?\nNo indent"
>> YAML.load(YAML.dump(s2))
=> " \n Do I work?\nNo indent"
>>
?>
?> s3 = "\n Do I work?\nNo indent"
=> "\n Do I work?\nNo indent"
>> YAML.load(YAML.dump(s3))
ArgumentError: syntax error on line 3, col 0: `No indent'
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/yaml.rb:133:in `load'
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/yaml.rb:133:in `load'
	from (irb):9
	from /System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/rational.rb:323

OK, so it appears that our string has to start with a newline for the error to be triggered. Let's look at the YAML created by YAML.dump for each of these 3 strings:

>> require 'yaml'
=> true
>> s1 = " Do I work?\nNo indent"
=> " Do I work?\nNo indent"
>> YAML.dump(s1)
=> "--- \" Do I work?\\n\\\nNo indent\"\n"
>>
?>
?> s2 = " \n Do I work?\nNo indent"
=> " \n Do I work?\nNo indent"
>> YAML.dump(s2)
=> "--- \" \\n Do I work?\\n\\\nNo indent\"\n"
>>
?>
?> s3 = "\n Do I work?\nNo indent"
=> "\n Do I work?\nNo indent"
>> YAML.dump(s3)
=> "--- |-\n\n Do I work?\nNo indent\n"

Notice that the actual YAML produced differs substantially from s1&s2 to s3. Both s1 and s2 encode their values as strings by wrapping the values in quotes. s3, however, is encoded using the |- indicator. Let's take a look at some YAML documentation to better understand what's going on.

 

The Blocks section defines how multi-line strings (using the | indicator) operate. It's not crystal clear, but ultimately from reading this whole section, I've gathered that YAML requires that whatever block width (the spaces padding the first row) is used in that first row must be maintained for all successive new lines. In our failing example (s3), our first line with content started with 2 spaces. As such, when the second line was encountered and characters were present prior to the two spaces, YAML treated it as a separate attribute, throwing an error because it wasn't encoded as an attribute (in other words, it didn't look like: some_other_attribute: some_value).

So, as promised, let's define the characteristics of a string that appear to confuse the ruby YAML library (note that this is probably incomplete and possibly even incorrect, as I've only spent a few hours with this):

  • The string begins with a newline character
  • The first non-empty line begins with an indent (spaces, not tabs...this is YAML)
  • At least one successive line begins without this same indent

Great.....so where's the patch? This is how I got around this problem:

require 'yaml'
class String
  alias :old_to_yaml :to_yaml
  def to_yaml(options={})
    new_str = String.new("#{self}")
    new_str = " #{new_str}" if new_str =~ /^[\n\r]/
    new_str.old_to_yaml(options)
  end
end

I want to point out that I think this patch is potentially very dangerous. I really don't recommend using it unless you absolutely have to, and you're ok with actually changing the value of your string. If you could even isolate it further up the line (by not overriding String#to_yaml), I'd definitely recommend that.

For my situation, there's a few reasons why this is a suitable solution. First, my problem strings (which come from external sources) are deep within an object hierarchy that I'm converting to YAML (not just a simple string like this example). Furthermore, I'm dealing with html content, so adding an extra space at the front has no effect whatsoever. Lastly, this is being used by an external and isolated script, so the overriding of String#to_yaml is done on a very localized level. I would never override String#to_yaml in this fashion in a rails app.

Just for fun (and to prove it works), let's run that through irb:

>> s3 = "\n Do I work?\nNo indent"
=> "\n Do I work?\nNo indent"
>>
>>
>> YAML.dump(s3)
=> "--- |-\n\n Do I work?\nNo indent\n"
>>
>>
>> class String
>> alias :old_to_yaml :to_yaml
>> def to_yaml(options={})
>> new_str = String.new("#{self}")
>> new_str = " #{new_str}" if new_str =~ /^[\n\r]/
>> new_str.old_to_yaml(options)
>> end
>> end
=> nil
>>
>>
>> YAML.dump(s3)
=> "--- \" \\n Do I work?\\n\\\nNo indent\"\n"
>>
>>
>> YAML.load(YAML.dump(s3))
=> " \n Do I work?\nNo indent"

Note that by inserting a space prior to the initial newline, the YAML created is no longer using the |- format, but rather is enclosed by quotes.

I've got to admit that I still don't fully understand the situation here. If any readers have further insights, I would love to hear from you, because my workaround still leaves me a little uncomfortable. I was also fairly disconcerted to not be able to really find anyone else running into this problem when googling around. The closest thing I found was a bug in JRuby that at first looked promising, but turned out to be something else altogether.