split

Some of the aspects of the split command are documented in section 21.9 of the Unix Power Tools book.

The split command is useful for taking large files and splitting them into smaller files, usually of the same size. For instance, suppose that we want to take the file containing all of the donations to federal election campaigns in the year 2020 and split it into smaller files, so each one is a little smaller, for instance, in case we need to import the files a little bit at a time, or just take a sample of the files. We see that this file has more than 96 million donations:

wc /anvil/projects/tdm/data/election/itcont2020.txt

To split the file into 97 files with 1 million lines each, we can first copy the file into our scratch directory (here, you will need to change x-mdw to your Anvil username and be sure to put the x- at the start; for instance, Dr Ward’s login is mdw so his username is x-mdw):

cp /anvil/projects/tdm/data/election/itcont2020.txt /anvil/scratch/x-mdw/itcont2020.txt

We make a copy of the file to our scratch directory because this file is more than 18 GB in size.

Then we split the file into 97 files. The -l option is used to indicate the number of lines we want to put into each file.

cd /anvil/scratch/x-mdw/
split -l 1000000 itcont2020.txt

We can check that each file has 1000000 lines (except for the very last file, named xds, which has 467122 lines). Because the total number of lines in the original file, 96467122, is not exactly divisible by 1000000, we know that the last file will have a smaller number of lines.

wc x*

By default, the split command uses x as the first letter of the output files, followed by two more letters, in alphabetic order. So the first file is called xaa and then xab, xac, xad, xae, etc.

Sometimes we want to specify the names of the output files. For instance, suppose that we want each file to start with smalldonationfile isntead of x. Then we can run:

split -l 1000000 itcont2020.txt smalldonationfile

Notice that, now split uses the prefix smalldonationfile with the same aa and ab and ac etc., as before, so that the file names are smalldonationfileaa and smalldonationfileab and smalldonationfileac etc.

If you prefer to index your files with numbers instead of aa and ab and ac, you can use the -d option:

split -d -l 1000000 itcont2020.txt smalldonationfile

This produces the files smalldonationfile00 and smalldonationfile01 and smalldonationfile02 etc…. but notice that the last few files have unexpected files names, namely, smalldonationfile9000 through smalldonationfile9006.

We can (instead) use:

split -a 2 -d -l 1000000 itcont2020.txt smalldonationfile

to specify that each file should have a 2-digit number extension, or

split -a 4 -d -l 1000000 itcont2020.txt smalldonationfile

to specify that each file should have a 4-digit number extension.

Shuffling a file before splitting it

When we are taking samples of files, it is often a good desirable to randomize the lines in the file first. In this way, the data in our sample is representative of the entire data set. We can use the shuf function to do this, for instance, like this:

shuf itcont2020.txt >myshuffledfile.txt

split -a 4 -d -l 1000000 myshuffledfile.txt myshuffleddonations.txt

Note that the shuf will only work if you have enough memory in your Jupyter Lab session to read the entire file (which is more than 18GB) into memory!