how to split a file into multiple files in linux

Quite often you encounter files, usually text files that are just too big to handle. Various file operations such as even opening these files in a text editor can eat into the memory of the machine making it sluggish. It also makes searching and other text processing inside the files even more time and resource consuming.

A good example of this is the log files. Most times you only need a part of the file for processing or analyzing at one time. We will see the different options you have to split a large file into several smaller files.

The best option to do this is the bash core utility named split. The split command or utility allows you to split by lines, size or the number of smaller files you need. Another related utility is csplit than can also be used.

The other options you have is to use other text processing commands such as head, tail, awk or sed. None of these are as friendly as split, but sometimes these would work better depending on your requirements. We will deal mostly with the split command in this post, because that is the most useful and friendly for this function.

As mentioned, most of the examples below use the split command. Here is the generic syntax of the split command. We will look at some specific options available with it at the end, after going through the examples.

$ split [options] <source file> <output prefix>

Although i use the text files as examples in this post. You can use these for any type of files, including binary files.

split files based on # of lines

Let's say we want to split the file into several files based on a predetermined number of lines. This works best if the file contains lines separated by the end of line character, as it usually does. Let's split our big file (eg. bigfile.txt) into files with 275 lines each.

$ split -l 275 bigfile.txt

This will split the files into several files, named xaa, xab, xac etc. each of which contain 275 lines each. If you don't specify the number of lines (-l) then the default is 1000. The default for the output file prefix is x in most cases. I usually like to specify a prefix that ends with "-", but that is upto you.

$ split -l 275 bigfile.txt smallfile-

If you prefer numerical suffixes instead of the character suffixes in the output, use the -d option with the command.

$ split -l 275 -d bigfile.txt smallfile-

Sometimes, you want a specific extension for your files such as .txt. You can do this using the command line option –additional-suffix

$ split -l 275 -d bigfile.txt smallfile- --additional-suffix=.txt

As I said before, some other linux commands such as head or tail can be used as well depending on your requirement. If you want only the first 275 lines of the file and do not care about the rest, then you could use the head command and redirect the output to a file.

$ head -n 275 bigfile.txt > smallfile.txt

Now, if you want only the last 275 lines of the file and do not care about the rest then you could use the tail command.

$ tail -n 275 bigfile.txt > smallfile.txt

Another helpful utility is csplit, and you can specify the line number as in the example below and the file will be broken into two output files, the first containing the first 275 lines and the other containing the rest of the file content.

$ csplit bigfile.txt 275

split files based on size

Many times, you want to divide the file based on the size of the file and not by number of lines. You can do that again by using the split command. The -b or –bytes option allow you to specify the specific size of the resulting files.

Let's say we want to divide the file into several files each of which is 5k in size. You can specify the size in bytes, kilobytes, mega bytes etc. as well.

$ split -b 5k bigfile.txt smallfile-

All the options that we saw before such as -d and –additional-suffix will work with the size option as well.

split into specific number of files

Sometimes you just want to split the file into a specific number of equal sized files, regardless of the size or length. The command line option -n or –number allows you to do this.

If you want to split the file into 2 equally sized files, then you can do something like this:

$ split -n 2 -d bigfile.txt smallfile-

Of course, to split it in to even more number of files you specify the number with the -n option. One issue with splitting it like this is that it could cause the lines to be split between the files. In most cases, you want the lines to be preserved so that the entire line is within the same output file.

$ split -n l/10 -d bigfile.txt smallfile-

The above example will split the file into 10 equally sized files while preserving the lines. That means that lines will not be split between files. The value or argument is a lowercase L, just to be clear.

Sometimes, you want just part of the file and not the entire file. For example, if you want to split the file into 4 equal parts but is only interested in the 3rd section or part, then you could do something like:

$ split -n l/3/4 bigfile.txt > myfile.txt

We use the redirection operator (>) to create and redirect the output to a file named myfile.txt which now contains only the 3rd part (out of 4) of the bigfile.txt.

split based on content

Another common use case is when you want to split based on the content of the file. This is a specialized use case, but can be very useful. The utility named csplit can be used to split files into sections determined by the context or content of the lines.

The generic syntax of the csplit command is

$ csplit [options] <source file> <regex expression>

So, as an example if you want to split a file when you encounter the text or line Error, then you could do…

$ csplit -k bigfile.txt '/^Error/' {*}

The above command will split the file whenever it finds a line that starts with the word Error. The argument '/^Error'/ is the regular expression we are matching against. The next argument {}* specifies how many times the match should be repeated. The argument '{*}' specifies that it will repeated till the end of file.

You could also specify something like '{4}', which means only the first 4 instances of Error will be split into files and the rest of the file content will all be in the next (and last) output file.

Advanced Options of Split

We have already seen some of these options in the examples above. But I will summarize it again with some other useful options.

-l or –lines: the number of lines per output file.
-b or –bytes: the number of bytes per output file. You can use human readable format such as kb or mb.
-a or –suffix-length: The number of characters to use as suffix. the default is 2 and the suffix starts as 'aa', 'ab', 'ac' etc.
-d or –numeric-suffixes: Use numbers or numerical suffix instead of characters. The suffix will be '00', '01', '02' etc.
-n or –number: the number of files the source should be broken into. You could also use the format such as 'l/N' or 'l/K/N' that specifies how the lines should be broken and which section need to be printed out.
-t or –separator: Specify a different character as new line separator instead of the standard newline. This is helpful if you have a file where all content is in a single line.

There are also several graphical utilities that allow you to do the same, if you prefer not to use the command line. Some of the ones that have seen are HJSplit, HOZ (Hacha Open Zource), and Gnome Split. You can give any of them a try. Just as splitting, you might already have several smaller files that you want to merge or join into a larger file which can also be done in a similar fashion.