String sorting in R appears to use different ordering from everyone else

April 12th, 2018 | Categories: C/C++, matlab, programming, python, R | Tags:

Update
A discussion on twitter determined that this was an issue with Locales. The practical upshot is that we can make R act the same way as the others by doing

Sys.setlocale("LC_COLLATE", "C")

which may or may not be what you should do!

Original post

While working on a project that involves using multiple languages, I noticed some tests failing in one language and not the other. Further investigation revealed that this was essentially because R's default sort order for strings is different from everyone else's.

I have no idea how to say to R 'Use the sort order that everyone else is using'. Suggestions welcomed.

R 3.3.2

sort(c("#b","-b","-a","#a","a","b"))

[1] "-a" "-b" "#a" "#b" "a" "b"

Python 3.6

sorted({"#b","-b","-a","#a","a","b"})

['#a', '#b', '-a', '-b', 'a', 'b']


MATLAB 2018a

sort([{'#b'},{'-b'},{'-a'},{'#a'},{'a'},{'b'}])

ans =
1×6 cell array
{'#a'} {'#b'} {'-a'} {'-b'} {'a'} {'b'}

C++

int main(){ 

std::string mystrs[] = {"#b","-b","-a","#a","a","b"}; 
std::vector<std::string> stringarray(mystrs,mystrs+6);
std::vector<std::string>::iterator it; 

std::sort(stringarray.begin(),stringarray.end());

for(it=stringarray.begin(); it!=stringarray.end();++it) {
   std::cout << *it << " "; 
} 

return 0;
} 

Result:

#a #b -a -b a b
  1. May 24th, 2018 at 23:07
    Reply | Quote | #1

    There’s a whole analysis of R’s sorting implementation, along with trying a bunch of implementations out in Julia, at this post: https://www.codementor.io/zhuojiadai/julia-vs-r-vs-python-string-sort-performance-an-unfinished-journey-to-optimizing-julia-s-performance-f57tf9f8s

  2. P. Fonseca
    May 28th, 2018 at 08:14
    Reply | Quote | #2

    And on the WL:

    In[] := Sort[{“#b”, “-b”, “-a”, “#a”, “a”, “b”}]
    out[]:= {“#a”, “-a”, “a”, “#b”, “-b”, “b”}