File character counts

Once in a while I need to know what characters are in a file and how often each appears.

One reason I might do this is to look for statistical anomalies.

Another reason might be to see whether a file has any characters it’s not supposed to have, which is often the case.

A few days ago Fatih Karakurt left an elegant solution to this problem in a comment: fold -w1 file | sort | uniq -c The fold function breaks the content of a file in to lines 80 characters long by default, but you can specify the line width with the -w option.

Setting that to 1 makes each character its own line.

Then sort prepares the input for uniq, and the -c option causes uniq to display counts.

This works on ASCII files but not Unicode files.

For a Unicode file, you might do something like the following Python code.

import collections count = collections.

Counter() file = open(“myfile”, “r”, encoding=”utf8″) for line in file.

readlines(): for c in line.

strip(“.”): count[ord(c)] += 1 for p in sorted(list(count)): print(chr(p), hex(p), count[p]).

. More details

Leave a Reply