split
Some of the aspects of the split
command are documented in section 21.9 of the Unix Power Tools book.
The split command is useful for taking large files and splitting them into smaller files, usually of the same size. For instance, suppose that we want to take the file containing all of the donations to federal election campaigns in the year 2020 and split it into smaller files, so each one is a little smaller, for instance, in case we need to import the files a little bit at a time, or just take a sample of the files. We see that this file has more than 96 million donations:
wc /anvil/projects/tdm/data/election/itcont2020.txt
To split the file into 97 files with 1 million lines each, we can first copy the file into our scratch
directory (here, you will need to change x-mdw
to your Anvil username and be sure to put the x-
at the start; for instance, Dr Ward’s login is mdw
so his username is x-mdw
):
cp /anvil/projects/tdm/data/election/itcont2020.txt /anvil/scratch/x-mdw/itcont2020.txt
We make a copy of the file to our scratch
directory because this file is more than 18 GB in size.
Then we split the file into 97 files. The -l
option is used to indicate the number of lines we want to put into each file.
cd /anvil/scratch/x-mdw/
split -l 1000000 itcont2020.txt
We can check that each file has 1000000 lines (except for the very last file, named xds
, which has 467122 lines). Because the total number of lines in the original file, 96467122, is not exactly divisible by 1000000, we know that the last file will have a smaller number of lines.
wc x*
By default, the split
command uses x
as the first letter of the output files, followed by two more letters, in alphabetic order. So the first file is called xaa
and then xab
, xac
, xad
, xae
, etc.
Sometimes we want to specify the names of the output files. For instance, suppose that we want each file to start with smalldonationfile
isntead of x
. Then we can run:
split -l 1000000 itcont2020.txt smalldonationfile
Notice that, now split
uses the prefix smalldonationfile
with the same aa
and ab
and ac
etc., as before, so that the file names are smalldonationfileaa
and smalldonationfileab
and smalldonationfileac
etc.
If you prefer to index your files with numbers instead of aa
and ab
and ac
, you can use the -d
option:
split -d -l 1000000 itcont2020.txt smalldonationfile
This produces the files smalldonationfile00
and smalldonationfile01
and smalldonationfile02
etc…. but notice that the last few files have unexpected files names, namely, smalldonationfile9000
through smalldonationfile9006
.
We can (instead) use:
split -a 2 -d -l 1000000 itcont2020.txt smalldonationfile
to specify that each file should have a 2-digit number extension, or
split -a 4 -d -l 1000000 itcont2020.txt smalldonationfile
to specify that each file should have a 4-digit number extension.
Shuffling a file before splitting it
When we are taking samples of files, it is often a good desirable to randomize the lines in the file first. In this way, the data in our sample is representative of the entire data set. We can use the shuf
function to do this, for instance, like this:
shuf itcont2020.txt >myshuffledfile.txt
split -a 4 -d -l 1000000 myshuffledfile.txt myshuffleddonations.txt
Note that the shuf
will only work if you have enough memory in your Jupyter Lab session to read the entire file (which is more than 18GB) into memory!