Post

wc in Golang

So I recently came across codingchallenges.fyi through some Reddit post - if memory serves me right. As someone with the attention span of a skibidi-toilet kid, I found it to be perfect for me. Rather than being a full walkthrough about how to build something, the author merely provides a little overview, and the increments in which you should develop the project. The minimal handholding leaves it open for you to research on your own and approach the project as you wish without having to follow the author’s design. So I decided to dabble into it today, and completed the first project - the wc unix command. If you have ever messed around with Linux you’ll probably recognize it as one of the coreutils. Enough yapping done I guess, let’s move on to the implementation.

In case you just want the code, it’s available here on my github.

Step 0

Setting up the environment

I’m going to be using Golang for this project(duh), and also for all my future projects unless I can come over the Rust skill issue or sell my soul to Python. I’m using Go 1.22 (haha arch btw) but any version not built by Jesus himself would work. If you still want a newer version, you can use nix or brew(if you’re on mac), or maybe Docker. Or maybe just get the latest version from the website, gah. Also if you’re on Windows and not using wsl, just use wsl.

So, creating the project directory and initializing the go module.

1
2
3
~ ❭❭❭ cd Projects/
Projects ❭❭❭ mkdir gowc && cd gowc
gowc ❭❭❭ go mod init github.com/USERNAME/gowc

I use Goland but you may as well use VSCode, vim, helix or whatever fits you.

Create a .gitignore. I usually add .idea/ or .vscode/ to it. Download the sample text provided and place it in the root of your project. Add it to .gitignore if you wish.

Step 1

Outputting the number of bytes

So in the first iteration, all our tool needs to do is accept a flag (-c just like original wc) and output the number of bytes - easy right!

Here let’s just get this running so you can get a hang of how things work, we’ll refactor everything later.

First thing we need to do is, check the flags and only proceed if -c is provided.

Flags - to framework or not?

Here you have the option to either use something like cobra or ufave/cli in order to handle the cli flags, or just use flag from the stdlib.

Although the framework approach might be suitable for larger tool, I’d suggest just using the stdlib for our dead simple tool.

So let’s populate the main.go file. I usually place it in a directory named cmd. You do as you please, you can always reference the repo later on!

So starting off

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
package main

import (
	"flag"
	"fmt"
	"os"
	"path/filepath"
)

const (
	SUCCESS = 0
	FAILURE = 1
)

func main() {
	//filepath check
	filePath := os.Args[len(os.Args)-1]
	fd, err := os.Open(filePath)
	if err != nil {
		fmt.Println("Invalid filePath")
		os.Exit(FAILURE)
	}

	//byte count
	c := flag.Bool("c", false, "gowc -c filename")

	flag.Parse()
	var bytes int64

	if *c {
		fileinfo, err := fd.Stat()
		if err != nil {
			fmt.Println(err)
			os.Exit(FAILURE)
		}
		bytes = fileinfo.Size()
	}
	fmt.Printf("%v %v\n", bytes, filepath.Base(filePath))
	os.Exit(SUCCESS)
}

That sure’s a lot of code! Breaking it down.

After importing our packages, we declare the exit codes. In the main function:

  • First we need to check whether the file provided exists or not. If it doesn’t, os.Open returns an error and we quit nice and simple.
  • c := flag.Bool("c", false, "gowc -c filename") here I used the Bool function of the flag package. What does it do? Any half decent text editor with the appropriate plugins should show it nice and clear, but in case yours doesn’t, you can always use the docs. But anyways, as the docs describe, the Bool function takes 3 arguments and returns a pointer to a bool. The first argument is the flag, c in our case. The second is the default value, and third is the help message for the flag. The bool returned would be true if the user uses the flag, false otherwise.

    In case you are not comfortable reading documentation, spend some time and make sure you are. It’s probably the best decision you’ll make!

  • flag.Parse() parses the flags from os.Args. Color me surprised!
  • *c dereferences the pointer and returns its value. Using the fd we got from os.Open, we get the file metadata using fd.Stat, check for errors and then get the file content size(in bytes), print it and exit nice and sweet. Easy!

And that is it! Our app is complete(not really!!)!

Now to test it

1
2
 gowc ❭❭❭  go run cmd/*.go -c test.txt
 342190 test.txt

It Works!!

Step 2 and 3

Refactoring first

Now we did get our little tool running, but the code isn’t all that good. In fact, it’s pretty shit. So let’s refactor it first before moving on with further features.

Let’s move the byte counting logic out of the main function and into its own separate file. I named the new file file-util.go and placed it in the cmd/ directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
package main

import (
	"flag"
	"fmt"
	"os"
)

const (
	SUCCESS = 0
	FAILURE = 1
)

type App struct {
	Fd *os.File
	//byte count
	C bool
}

func main() {
	//Appwide config
	app := App{}

	//filepath check
	filePath := os.Args[len(os.Args)-1]
	fd, err := os.Open(filePath)
	if err != nil {
		fmt.Println("Invalid filePath")
		os.Exit(FAILURE)
	}
	app.Fd = fd

	//byte count
	flag.BoolVar(&app.C, "c", true, "gowc -c pathToFile")

	flag.Parse()

	counts, err := app.Generate()
	if err != nil {
		fmt.Println(err)
		os.Exit(FAILURE)
	}
	fmt.Println(counts)
	os.Exit(SUCCESS)
}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package main

import "strconv"

func (a *App) Generate() (string, error) {
	ret := ""
	var count int64
	var err error

	if a.C {
		count, err = a.ByteCount()
		if err != nil {
			return ret, err
		}
		ret += strconv.Itoa(int(count)) + " "
	}

	ret += a.Fd.Name()
	return ret, nil
}

func (a *App) ByteCount() (int64, error) {
	fileInfo, err := a.Fd.Stat()
	if err != nil {
		return -1, err
	}
	bytes := fileInfo.Size()
	return bytes, nil
}

The code is pretty self-explanatory for the most part. We moved the byte-counting logic into the ByteCount() function, and create a Generate() function that simply checks for the flags and generates the final string. Notice that both these functions are actually receivers(or methods) on the App struct which is declared in the main.go file. The struct merely holds the booleans for the flags as well as the file descriptor. Notice I changed the function call in main from flag.Bool() to flag.BoolVar(). Nice and simple!

Adding the line count

We’re going to add the word count and line count together in one function because it’s better than having to loop over a huge file twice. So setup the appropriate flags in the App struct and main function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
...
type App struct {
	Fd *os.File
	//byte count
	C bool
	//Line count
	L bool
	//Word count
	W bool

	//Length
	fileLen int
}
...
	flag.BoolVar(&app.C, "c", false, "gowc -c pathToFile")
	flag.BoolVar(&app.L, "l", false, "gowc -l pathToFile")
	//need to fix
	//flag.BoolVar(&app.W, "w", false, "gowc -w pathToFile
...

To find the number of lines, simply read the file byte by byte and find the number of \n characters. To find the number of words, you compare with the previous byte and increment the counter if the last byte was a space(or tab, or newline, or carriage return) and the current isn’t.

The word count implementation wasn’t correct at this stage. I fixed it later on.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
	Generate() 
	//add after if a.C 
	if a.L || a.W {
		lcount, wcount := a.OtherCounts()
		if a.L {
			ret += strconv.Itoa(lcount) + " "
		}
		if a.W {
			ret += strconv.Itoa(wcount) + " "
		}
	}
	...
	// OtherCounts counts and returns number of lines and words
func (a *App) OtherCounts() (int, int) {
	//b := make([]byte, 1)
	fr := bufio.NewReader(a.Fd)

	var b, prev byte
	var err error
	wcount := 0
	lcount := 0

	for {
		b, err = fr.ReadByte()
		if err != nil {
			break
		}
		if b == '\n' {
			lcount++
		} else if b == ' ' && prev != ' ' {
			wcount++
		}
		prev = b
	}
	return lcount, wcount
}

If you feel like it, try fixing the word count on your own :)

I had issues in the word count even having reimplementing it correctly. As it turned out, I forgot to run locale-gen while installing Arch and my system wasn’t even using utf-8!!

Adding Tests

I wasn’t particularly interested in doing TDD neither did I want to perform unit testing of the individual functions. So I just added a few tests that execute the final binary and check the output. You can compare the output with wc output to make it work with any input text sample, but I didn’t feel like going all that way.

Create a file main_test.go in tests/ directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
package tests

import (
	"os/exec"
	"strings"
	"testing"
)

func TestByteCount(t *testing.T) {
	cmd := exec.Command("../build", "-c", "../test.txt")

	out, err := cmd.Output()
	if err != nil {
		t.Error(err.Error())
	}
	s := string(out)
	bef, _, _ := strings.Cut(s, " ")
	if bef != "335001" {
		t.Fail()
	}
}

func TestLineCount(t *testing.T) {
	cmd := exec.Command("../build", "-l", "../test.txt")

	out, err := cmd.Output()
	if err != nil {
		t.Error(err.Error())
	}
	s := string(out)
	bef, _, _ := strings.Cut(s, " ")
	if bef != "7147" {
		t.Fail()
	}
}

func TestWordCount(t *testing.T) {
	cmd := exec.Command("../build", "-w", "../test.txt")

	out, err := cmd.Output()
	if err != nil {
		t.Error(err.Error())
	}
	s := string(out)
	bef, _, _ := strings.Cut(s, " ")
	if bef != "58070" {
		t.Fail()
	}
}

My values are different than what they should’ve been because wc was giving incorrect output due to the locale issue on my system.

You can run the tests using

1
gowc ❭❭❭ go test tests/*.go -v

Fixing the word count

I took reference from the wc coreutils code, and as it turns out it uses the isspace() function. Looking it up, turns out it flags spaces, newlines, carriage returns, tabs, form feed (\f) and whatever \v is.

So I added an isspace function to file-utils.go, refactored the word count code to use isspace, and enabled the -w flag again.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
//
func isSpace(b byte) bool {
	return b == ' ' || b == '\n' || b == '\t' || b == '\v' || b == '\f' || b == '\r'
}
//
func (a *App) OtherCounts() (int, int) {
	//b := make([]byte, 1)
	fr := bufio.NewReader(a.Fd)

	var b, prev byte
	var err error
	wcount := 0
	lcount := 0
	prev = ' '

	for {
		b, err = fr.ReadByte()
		if err != nil {
			break
		}
		if b == '\n' {
			lcount++
		} else if !isSpace(b) && isSpace(prev) {
			wcount++
		}
		prev = b
	}
	return lcount, wcount
}

It works perfectly fine with the locales fixed! Nice and simple!!

Adding character count - step 4

Wait, isn’t that what -c does?

Well yes but actually no. -c has our tool print the number of bytes. But a character is just a byte, right? right??

Sadly it’s not that simple. I won’t go much into the details, this blog post highlights it pretty well. But in short, depending on the locale, a single character can be one byte, or more.

1
2
3
4
5
6
package main

func main() {
	s := "🥺"
	fmt.Printf("% x\n", s)
}

On running it

1
2
~ ❭❭❭ go run bottom.go
f0 9f a5 ba

The bottom emoji, even though it’s a single character, is actually 4 bytes long! As such, a file containing the emoji will definitely have different character count and byte count.

To tackle stuff like this, a Rune is defined in Go, which basically refers to a single character. The bottom emoji is 4 bytes long, but only makes up one rune! The character ‘a’ is one byte long, and also equates to a single rune.

Thus finding the number of characters is pretty easy in Go, you can just count the number of runes. All you need to do is read the file contents into a []byte buffer and pass it to the RuneCount() function provided by the unicode/utf8 package.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
func (a *App) CharacterCount() (int, error) {
	if !a.C {
		f, err := a.ByteCount()
		if err != nil {
			return -1, err
		}
		a.fileLen = int(f)
	}

	b := make([]byte, a.fileLen)
	//if Line and Word count generated, file pointer at end
	_, err := a.Fd.ReadAt(b, 0)
	if err != nil {
		return -1, err
	}
	return utf8.RuneCount(b), nil
}

You can make the necessary changes to add the flags and in Generate() function yourselves, it’s the same as before. You can anyways refer to the repo if you’re having trouble anytime.

Step 5 - Printing by default

We need to add a default case. In case no flags were given by the user, the tool should print the line count, word count and character count - equivalent of wc -l -c -w file. Very simple to implement.

1
2
3
4
5
6
7
	//in case no flags provided
	if flag.NFlag() == 0 {
		app.C = true
		app.W = true
		app.L = true
	}

Add the snippet before calling the app.Generate() function. Pretty simple!

Final Step - accept from Stdin

If no file was provided as an input, our tool should read from stdin and work upon the data it gets. This will let us pipe output from other tools into ours nicely, as is the UNIX way :)

1
2
~ ❭❭❭ cat file.txt | gowc -l -c
1234 23 file.txt

This ended up being trickier than I expected. Without using a buffer, it didn’t seem possible to calculate character count the same old way. So I ended up moving all the counts into a single Count function. Now it uses a buffer with a kind of sliding window approach, where the buffer constantly reads the next max_buf bytes(configurable with the -b flag) and increments the counters accordingly.

The final main function should look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
type App struct {
	Fd *os.File
	//byte count
	C bool
	//Line count
	L bool
	//Word count
	W bool
	//Character count
	M bool
	//max buffer size
	Max_buf int
}

func main() {
	//Appwide config
	app := App{}

	//byte count
	flag.BoolVar(&app.C, "c", false, "gowc -c pathToFile")
	flag.BoolVar(&app.L, "l", false, "gowc -l pathToFile")
	flag.BoolVar(&app.W, "w", false, "gowc -w pathToFile")
	flag.BoolVar(&app.M, "m", false, "gowc -m pathToFile")
	flag.IntVar(&app.Max_buf, "b", 1048576, "gowc -w -b=1024 pathToFile") //1mb default
	flag.Parse()

	//in case no flags provided
	if flag.NFlag() == 0 {
		app.C = true
		app.W = true
		app.L = true
	}
	//whether file provided
	if flag.NArg() == 0 {
		app.Fd = os.Stdin
	} else {
		filePath := os.Args[len(os.Args)-1]
		fd, err := os.Open(filePath)
		if err != nil {
			fmt.Println("Invalid filePath")
			os.Exit(FAILURE)
		}
		app.Fd = fd
	}

	counts, err := app.Generate()
	if err != nil {
		fmt.Println(err)
		os.Exit(FAILURE)
	}
	fmt.Println(counts)
	os.Exit(SUCCESS)
}

And the final cmd/file-util.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
package main

import (
	"bufio"
	"strconv"
	"unicode/utf8"
)

func isSpace(b byte) bool {
	return b == ' ' || b == '\n' || b == '\t' || b == '\v' || b == '\f' || b == '\r'
}

func (a *App) Generate() (string, error) {
	ret := ""
	//var count int64
	//var err error

	lcount, wcount, ccount, rcount := a.Counts()
	if a.L {
		ret += strconv.Itoa(lcount) + " "
	}
	if a.W {
		ret += strconv.Itoa(wcount) + " "
	}
	if a.C {
		ret += strconv.Itoa(ccount) + " "
	}
	if a.M {
		ret += strconv.Itoa(rcount) + " "
	}

	ret += a.Fd.Name()
	return ret, nil
}

// Counts counts and returns number of lines, words, characters and bytes
func (a *App) Counts() (int, int, int, int) {
	fr := bufio.NewReader(a.Fd)
	buf := make([]byte, a.Max_buf)

	var prev byte
	wcount := 0
	lcount := 0
	ccount := 0
	rcount := 0
	prev = ' '

	for {
		n, err := fr.Read(buf)
		ccount += n
		rcount += utf8.RuneCount(buf[0:n])
		for i := 0; i < n; i++ {
			if buf[i] == '\n' {
				lcount++
			}
			if !isSpace(buf[i]) && isSpace(prev) {
				wcount++
			}
			prev = buf[i]
		}
		if err != nil {
			break
		}

	}
	return lcount, wcount, ccount, rcount
}

gowc-ing a large file! Less than a second! Fairly efficient if you ask me

Well that’s all! This is my first time doing technical writing, so bear with me if I make a few mistakes :P See ya next time around

P.S. probably gonna make the redis server next.

This post is licensed under CC BY 4.0 by the author.