Word frequency

Note: much of the following exposition is no longer critical to the task as the requirements have been updated, but is left here for historical and informational reasons.

This is slightly trickier than it appears initially. The task specifically states: "A word is a sequence of one or more contiguous letters", so contractions and hyphenated words are broken up. Initially we might reach for a regex matcher like /\w+/ , but \w includes underscore, which is not a letter but a punctuation connector; and this text is full of underscores since that is how Project Gutenberg texts denote italicized text. The underscores are not actually parts of the words though, they are markup.

We might try /A-Za-z/ as a matcher but this text is bursting with French words containing various accented glyphs. Those are letters, so words will be incorrectly split up; (Misérables will be counted as 'mis' and 'rables', probably not what we want.)

Actually, in this case /A-Za-z/ returns very nearly the correct answer. Unfortunately, the name "Alèthe" appears once (only once!) in the text, gets incorrectly split into Al & the, and incorrectly reports 41089 occurrences of "the". The text has several words like "Panathenæa", "ça", "aérostiers" and "Keksekça" so the counts for 'a' are off too. The other 8 of the top 10 are "correct" using /A-Za-z/, but it is mostly by accident.

A more accurate regex matcher would be some kind of Unicode aware /\w/ minus underscore. It may also be useful, depending on your requirements, to recognize contractions with embedded apostrophes, hyphenated words, and hyphenated words broken across lines.

Here is a sample that shows the result when using various different matchers.

sub MAIN ($filename, $top = 10) {
    my $file = $filename.IO.slurp.lc.subst(/ (<[\w]-[_]>'-')\n(<[\w]-[_]>) /, {$0 ~ $1}, :g );
    my @matcher = (
        rx/ <[a..z]>+ /,    # simple 7-bit ASCII
        rx/ \w+ /,          # word characters with underscore
        rx/ <[\w]-[_]>+ /,  # word characters without underscore
        rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /   # word characters without underscore but with hyphens and contractions
    );
    for @matcher -> $reg {
        say "\nTop $top using regex: ", $reg.perl;
        .put for $file.comb( $reg ).Bag.sort(-*.value)[^$top];
    }
}

Passing in the file name and 10:

Output:

Top 10 using regex: rx/ <[a..z]>+ /
the     41089
of      19949
and     14942
a       14608
to      13951
in      11214
he      9648
was     8621
that    7924
it      6661

Top 10 using regex: rx/ \w+ /
the     41035
of      19946
and     14940
a       14577
to      13939
in      11204
he      9645
was     8619
that    7922
it      6659

Top 10 using regex: rx/ <[\w]-[_]>+ /
the     41088
of      19949
and     14942
a       14596
to      13951
in      11214
he      9648
was     8621
that    7924
it      6661

Top 10 using regex: rx/ <[\w]-[_]>+[["'"|'-'|"'-"]<[\w]-[_]>+]* /
the     41081
of      19930
and     14934
a       14587
to      13735
in      11204
he      9607
was     8620
that    7825
it      6535