I contribute as I can to the Ruby Style Guide maintained by Bozhidar Batsov. One of the most hotly debated topics is the guideline of keeping lines of code to eighty characters or fewer. Bozhidar even discussed maximum line length in his first blog post explaining the reasoning behind some of the guidelines. He, and the commenters on that post, bring up some points in support of the eighty-character limit but I wanted to take a look at “representative” code and see what line length people naturally kept themselves within. I believe most good coders naturally keep themselves to shorter lines except when they need to have a longer line to properly express an idea.
The first problem was how to find representative code. I naturally turned to GitHub to use their new API for search to get the list of the ten most popular Ruby projects. Most popular is defined as having the most “stars”, a mechanism on GitHub whereby people mark a project as a favorite of theirs. The more favorites, the more popular.
Now that I had a corpus of code to look at, I wanted to ensure reasonably easily that I was looking at code. So I excluded all lines that:
- Consist solely of whitespace
- Consist solely of comments
- Contain only the keyword
- Exceed 200 characters in length1
After some messing around I built a program to query GitHub, download the latest version of the code and parse all the files. You can find my project on GitHub: Line Length Miner. When I ran the script, these are the results I found:
- Count: 257,344 lines examined
- Mean: 44.73162 characters
- Standard Deviation: 42.37547
- 95th percentile: ~90 characters
- 99th percentile: ~123 characters
- 99.9th percentile: ~176 characters
- Percentile of 80 characters: 91.82262
- Percentile of 100 characters: 96.88238
- Percentile of 120 characters: 98.81598
- Percentile of 132 characters: 99.31609
And here is the histogram of the data:
It would appear that the length of lines of code do not follow specifically a normal distribution, though we could still use a normal distribution to approximate it. Because I wanted to have things be as accurate as possible, for all my percentile calculations I worked directly from the dataset and did not approximate based on the mean and standard deviation.
The most interesting thing to note is that representative Ruby code has a hard time staying under eighty characters. If we were to force all ten projects to stay within eighty characters, no exceptions, over eight percent of their code would need to be touched … over twenty thousand lines!
So if the line length limit shouldn’t be eighty characters, what should it be? The vast majority of code can stay within 132 characters, so I don’t think it should be any higher than that. And only the tiniest fraction less code can stay within 120 characters. So it is really a choice between 100 and 120 characters for Ruby code. 100 to allow those with small monitors to be able to work more easily or 120 to err on the side of permissiveness. I’ve reconfigured my editor for 100 characters.
The assumption here is that lines that are beyond 200 characters in length aren’t actually lines of code, but strings or other encoded data. ↩