Matt's Haskell Learnings: LYAH Chapter 9b

Files and streams

getChar is an I/O action that reads a single character from the terminal.

getLine is an I/O action that reads a line from the terminal. These two are pretty straightforward and most programming languages have some functions or statements that are parallel to them. But now, let's meet

getContents. getContents is an I/O action that reads everything from the standard input until it encounters an end-of-file character. Its type is getContents :: IO String. What's cool about getContents is that it does lazy I/O.

These two programs are the same:

import Control.Monad
import Data.Char
main = forever $ do
putStr "Give me some input: "
l <- getLine
putStrLn $ map toUpper l

Same as:

import Data.Char
main = do
contents <- getContents
putStr (map toUpper contents)

$ cat haiku.txt | ./capslocker
I'M A LIL' TEAPOT
WHAT'S WITH THAT AIRPLANE FOOD, HUH?
IT'S SO SMALL, TASTELESS

A program that takes some input and prints out only those lines that are shorter than 10 characters.

main = do
contents <- getContents
putStr (shortLinesOnly contents)
shortLinesOnly :: String -> String
shortLinesOnly input =
let allLines = lines input
shortLines = filter (\line -> length line < 10) allLines
result = unlines shortLines
in result

interact takes a function of type String -> String as a parameter and returns an I/O action that will take some input, run that function on it and then print out the function's result.

main = interact $ unlines . filter ((<10) . length) . lines

respondPalindromes = unlines . map (\xs -> if isPalindrome xs then "palindrome" else "not a palindrome") . lines
where isPalindrome xs = xs == reverse xs

main = interact respondPalindromes

Reading and Writing Files

import System.IO
main = do
handle <- openFile "girlfriend.txt" ReadMode
contents <- hGetContents handle
putStr contents
hClose handle

openFile :: FilePath -> IOMode -> IO Handle takes returns an I/O action that will open a file and have the file's associated handle encapsulated as its result.

data IOMode = ReadMode | WriteMode | AppendMode | ReadWriteMode

hGetContents takes a Handle and returns an IO String — an I/O action that holds as its result the contents of the file.

hClose takes a handle and returns an I/O action that closes the file.

bracket :: IO a -> (a -> IO b) -> (a -> IO c) -> IO c (from Control.Exception) its first parameter is an I/O action that acquires a resource, such as a file handle. Its second parameter is a function that releases that resource. This function gets called even if an exception has been raised. The third parameter is a function that also takes that resource and does something with it.

withFile name mode f = bracket (openFile name mode)
(\handle -> hClose handle)
(\handle -> f handle)

bracketOnError :: IO a -> (a -> IO b) -> (a -> IO c) -> IO c (from Control.Exception) performs the cleanup only if an exception has been raised.

withFile :: FilePath -> IOMode -> (Handle -> IO a) -> IO a takes a path to a file, an IOMode and then it takes a function that takes a handle and returns some I/O action. What it returns is an I/O action that will open that file, do something we want with the file and then close it. The result encapsulated in the final I/O action that's returned is the same as the result of the I/O action that the function we give it returns.

import System.IO
main = do
withFile "girlfriend.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)

(\handle -> ... ) is the function that takes a handle and returns an I/O action and it's usually done like this, with a lambda.

hGetLine, hPutStr, hPutStrLn, hGetChar, etc. work just like their counterparts without the h, only they take a handle as a parameter and operate on that specific file instead of operating on standard input or standard output.

Reading Files as Strings

readFile :: FilePath -> IO String takes a path to a file and returns an I/O action that will read that file (lazily, of course) and bind its contents to something as a string. It's usually more handy than doing openFile and binding it to a handle and then doing hGetContents.

import System.IO
main = do
contents <- readFile "girlfriend.txt"
putStr contents

writeFile :: FilePath -> String -> IO () takes a path to a file and a string to write to that file and returns an I/O action that will do the writing

import System.IO
import Data.Char
main = do
contents <- readFile "girlfriend.txt"
writeFile "girlfriendcaps.txt" (map toUpper contents)

appendFile has a type signature that's just like writeFile, only appendFile doesn't truncate the file to zero length if it already exists but it appends stuff to it.

ToDo App - Append

import System.IO
main = do
todoItem <- getLine
appendFile "todo.txt" (todoItem ++ "\n")

main = do
withFile "something.txt" ReadMode (\handle -> do
contents <- hGetContents handle
putStr contents)

For text files, the default buffering is line-buffering usually - the smallest part of the file to be read at once is one line.

For binary files, the default buffering is usually block-buffering. That means that it will read the file chunk by chunk. The chunk size is some size that your operating system thinks is cool.

hSetBuffering controls how exactly buffering is done. It takes a handle and a BufferMode and returns an I/O action that sets the buffering.

BufferMode is a simple enumeration data type and the possible values it can hold are: NoBuffering, LineBuffering or BlockBuffering (Maybe Int). The Maybe Int is for how big the chunk should be, in bytes. If it's Nothing, then the operating system determines the chunk size. NoBuffering means that it will be read one character at a time. NoBuffering usually sucks as a buffering mode because it has to access the disk so much.

main = do
withFile "something.txt" ReadMode (\handle -> do
hSetBuffering handle $ BlockBuffering (Just 2048)
contents <- hGetContents handle
putStr contents)

hFlush takes a handle and returns an I/O action that will flush the buffer of the file associated with the handle.

ToDo App - Removing

import System.IO
import System.Directory
import Data.List
main = do
handle <- openFile "todo.txt" ReadMode
(tempName, tempHandle) <- openTempFile "." "temp"
contents <- hGetContents handle
let todoTasks = lines contents
numberedTasks = zipWith (\n line -> show n ++ " - " ++ line) [0..] todoTasks
putStrLn "These are your TO-DO items:"
putStr $ unlines numberedTasks
putStrLn "Which one do you want to delete?"
numberString <- getLine
let number = read numberString
newTodoItems = delete (todoTasks !! number) todoTasks
hPutStr tempHandle $ unlines newTodoItems
hClose handle
hClose tempHandle
removeFile "todo.txt"
renameFile tempName "todo.txt"

openTempFile (from System.IO) takes a path to a temporary directory and a template name for a file and opens a temporary file.

We could have also done mapM putStrLn numberedTasks

We ask the user which one they want to delete and wait for them to enter a number.

removeFile (System.Directory) takes a path to a file (not handle) and deletes it.

renameFile (System.Directory) takes a path to a file (not handle) and renames it.

Command line arguments

getArgs:: IO [String] from (System.Environment) is an I/O action that will get the arguments that the program was run with and have as its contained result a list with the arguments.

getProgName :: IO String is an I/O action that contains the program name.

import System.Environment
import Data.List
main = do
args <- getArgs
progName <- getProgName
putStrLn "The arguments are:"
mapM putStrLn args
putStrLn "The program name is:"
putStrLn progName

Full ToDo App

Dispatch association list of command line arguments -> functions of type [String] -> IO () that take the argument list as a parameter and return an I/Oaction that does the viewing, adding, deleting, etc.

import System.Environment
import System.Directory
import System.IO
import Data.List
dispatch :: [(String, [String] -> IO ())]
dispatch = [ ("add", add)
, ("view", view)
, ("remove", remove)
]
main = do
(command:args) <- getArgs
let (Just action) = lookup command dispatch
action args
add :: [String] -> IO ()
add [fileName, todoItem] = appendFile fileName (todoItem ++ "\n")
view :: [String] -> IO ()
view [fileName] = do
contents <- readFile fileName
let todoTasks = lines contents
numberedTasks = zipWith (\n line -> show n ++ " - " ++ line) [0..] todoTasks
putStr $ unlines numberedTasks
remove :: [String] -> IO ()
remove [fileName, numberString] = do
handle <- openFile fileName ReadMode
(tempName, tempHandle) <- openTempFile "." "temp"
contents <- hGetContents handle
let number = read numberString
todoTasks = lines contents
newTodoItems = delete (todoTasks !! number) todoTasks
hPutStr tempHandle $ unlines newTodoItems
hClose handle
hClose tempHandle
removeFile fileName
renameFile tempName fileName

Randomness

random :: (RandomGen g, Random a) => g -> (a, g) (from System.Random)

RandomGen typeclass is for types that can act as sources of randomness.

Random typeclass is for things that can take on random values.

Random takes a random generator (that's our source of randomness) and returns a random value and a new random generator.

StdGen that is an instance of the RandomGen typeclass.

We can either make a StdGen manually or we can tell the system to give us one based on a multitude of sort of random stuff.

mkStdGen :: Int -> StdGen creates a random generator. It takes an integer and based on that, gives us a (hardly) random generator.

ghci> random (mkStdGen 100) :: (Int, StdGen)
(-1352021624,651872571 1655838864)

ghci> random (mkStdGen 949488) :: (Float, StdGen)
(0.8938442,1597344447 1655838864)
ghci> random (mkStdGen 949488) :: (Bool, StdGen)
(False,1485632275 40692)
ghci> random (mkStdGen 949488) :: (Integer, StdGen)
(1691547873,1597344447 1655838864)

randoms takes a generator and returns an infinite sequence of values based on that generator.

ghci> take 5 $ randoms (mkStdGen 11) :: [Int]
[-1807975507,545074951,-1015194702,-1622477312,-502893664]

We could make a function that generates a finite stream of numbers and a new generator like this:

finiteRandoms :: (RandomGen g, Random a, Num n) => n -> g -> ([a], g)
finiteRandoms 0 gen = ([], gen)
finiteRandoms n gen =
let (value, newGen) = random gen
(restOfList, finalGen) = finiteRandoms (n-1) newGen
in (value:restOfList, finalGen)

randomR :: (RandomGen g, Random a) :: (a, a) -> g -> (a, g) takes as its first parameter a pair of values that set the lower and upper bounds and the final value produced will be within those bounds.

ghci> randomR (1,6) (mkStdGen 359353)
(6,1494289578 40692)

randomRs produces a stream of random values within our defined ranges.

ghci> take 10 $ randomRs ('a','z') (mkStdGen 3) :: [Char]
"ndkxbvmomg"

I/O Random

getStdGen is an I/O action, which has a type of IO StdGen. When your program starts, it asks the system for a good random number generator and stores that in a so called global generator. getStdGen fetches you that global random generator when you bind it to something.

import System.Random
main = do
gen <- getStdGen
putStr $ take 20 (randomRs ('a','z') gen)

Just performing getStdGen twice will ask the system for the same global generator twice.

newStdGen splits our current random generator into two generators. It updates the global random generator with one of them and encapsulates the other as its result.

import System.Random
main = do
gen <- getStdGen
putStrLn $ take 20 (randomRs ('a','z') gen)
gen' <- newStdGen
putStr $ take 20 (randomRs ('a','z') gen')

reads returns an empty list when it fails to read a string - use it if you don't want your program to crash on erronous input - it returns a singleton list with a tuple that has our desired value as one component and a string with what it didn't consume as the other.

Bytestrings

Processing files as strings tends to be slow. That overhead doesn't bother us so much most of the time, but it turns out to be a liability when reading big files and manipulating them.

Bytestrings are sort of like lists, only each element is one byte (or 8 bits) in size. The way they handle laziness is also different.

Strict bytestrings reside in Data.ByteString and they do away with the laziness completely - represent a series of bytes in an array - there are no thunks (the technical term for promise) involved.

Lazy bytestrings reside in Data.ByteString.Lazy - they're lazy, but not quite as lazy as lists - they are stored in chunks, each chunk has a size of 64K. Data.ByteString.Lazy has a lot of functions that have the same names as the ones from Data.List, only the type signatures have ByteString instead of [a] and Word8 instead of a in them.

import qualified Data.ByteString.Lazy as B
import qualified Data.ByteString as S

pack :: [Word8] -> ByteString takes a list, which is lazy, and making it less lazy, so that it's lazy only at 64K intervals.

Word8 is like Int but has a much smaller range, namely 0-255. It represents an 8-bit number. It's in the Num typeclass… e.g. 5 can take the type of Word8.

ghci> B.pack [99,97,110]
Chunk "can" Empty
ghci> B.pack [98..120]
Chunk "bcdefghijklmnopqrstuvwx" Empty

If you try to use a big number, like 336 as a Word8, it will just wrap around to 80.

Empty is like the [] for lists.

unpack is the inverse function of pack. It takes a bytestring and turns it into a list of bytes.

fromChunks takes a list of strict bytestrings and converts it to a lazy bytestring.

toChunks takes a lazy bytestring and converts it to a list of strict ones.

ghci> B.fromChunks [S.pack [40,41,42], S.pack [43,44,45], S.pack [46,47,48]]
Chunk "()*" (Chunk "+,-" (Chunk "./0" Empty))

This is good if you have a lot of small strict bytestrings and you want to process them efficiently without joining them into one big strict bytestring in memory first.

cons is the bytestring version of :. It takes a byte and a bytestring and puts the byte at the beginning. It's lazy though, so it will make a new chunk even if the first chunk in the bytestring isn't full.

cons' is the strict version of cons which is better to use if you're going to be inserting a lot of bytes at the beginning of a bytestring.

ghci> B.cons 85 $ B.pack [80,81,82,84]
Chunk "U" (Chunk "PQRT" Empty)
ghci> B.cons' 85 $ B.pack [80,81,82,84]
Chunk "UPQRT" Empty
ghci> foldr B.cons B.empty [50..60]
Chunk "2" (Chunk "3" (Chunk "4" (Chunk "5" (Chunk "6" (Chunk "7" (Chunk "8" (Chunk "9" (Chunk ":" (Chunk ";" (Chunk "<"
Empty))))))))))
ghci> foldr B.cons' B.empty [50..60]
Chunk "23456789:;<" Empty

The bytestring modules have a load of functions that are analogous to those in Data.List and System.IO (only Strings are replaced with ByteStrings).

If you're using strict bytestrings and you attempt to read a file, it will read it into memory at once! With lazy bytestrings, it will read it into neat chunks.

Let's make a simple program that takes two filenames as command-line arguments and copies the first file into the second file. Note that System.Directory already has a function called copyFile, but we're going to implement our own file copying function and program anyway.

import System.Environment
import qualified Data.ByteString.Lazy as B
main = do
(fileName1:fileName2:_) <- getArgs
copyFile fileName1 fileName2
copyFile :: FilePath -> FilePath -> IO ()
copyFile source dest = do
contents <- B.readFile source
B.writeFile dest contents

We make our own function that takes two FilePaths (remember, FilePath is just a synonym for String) and returns an I/O action that will copy one file into another using bytestring. In the main function, we just get the arguments and call our function with them to get the I/O action, which is then performed.

$ runhaskell bytestringcopy.hs something.txt ../../something.txt

Notice that a program that doesn't use bytestrings could look just like this, the only difference is that we used B.readFile and B.writeFile instead of readFile and writeFile. Many times, you can convert a program that uses normal strings to a program that uses bytestrings by just doing the necessary imports and then putting the qualified module names in front of some functions. Sometimes, you have to convert functions that you wrote to work on strings so that they work on bytestrings, but that's not hard.

Whenever you need better performance in a program that reads a lot of data into strings, give bytestrings a try, chances are you'll get some good performance boosts with very little effort on your part. I usually write programs by using normal strings and then convert them to use bytestrings if the performance is not satisfactory.

Exceptions

Exceptions more sense in I/O contexts because the outside world because it is so unreliable.

Pure code can throw exceptions too they can only be caught in the I/O part of our code (when we're inside a do block that goes into main). That's because you don't know when (or if) anything will be evaluated in pure code, because it is lazy and doesn't have a well-defined order of execution, whereas I/O code does.

Earlier, we talked about how we should spend as little time as possible in the I/O part of our program.

The logic of our program should reside mostly within our pure functions, because their results are dependant only on the parameters that the functions are called with.

When dealing with pure functions, you only have to think about what a function returns, because it can't do anything else.

This makes your life easier.

Even though doing some logic in I/O is necessary (like opening files and the like), it should preferably be kept to a minimum.

Pure functions are lazy by default, which means that we don't know when they will be evaluated and that it really shouldn't matter.

However, once pure functions start throwing exceptions, it matters when they are evaluated.

That's why we can only catch exceptions thrown from pure functions in the I/O part of our code.

And that's bad, because we want to keep the I/O part as small as possible. However, if we don't catch them in the I/O part of our code, our program crashes. The solution?

Don't mix exceptions and pure code. Take advantage of Haskell's powerful type system and use types like Either and Maybe to represent results that may have failed.

I/O exceptions are exceptions that are caused when something goes wrong while we are communicating with the outside world in an I/O action that's part of main.

...
contents <- readFile fileName
...
$ runhaskell linecount.hs i_dont_exist.txt
linecount.hs: i_dont_exist.txt: openFile: does not exist (No such file or directory)

Our program crashes.

What if we wanted to print out a nicer message if the file doesn't exist?

doesFileExist :: FilePath -> IO Bool (from System.Directory.) checks if a file exists…

import System.Environment
import System.IO
import System.Directory
main = do (fileName:_) <- getArgs
fileExists <- doesFileExist fileName
if fileExists
then do contents <- readFile fileName
putStrLn $ "The file has " ++ show (length (lines contents)) ++ " lines!"
else do putStrLn "The file doesn't exist!"

Another solution here would be to use exceptions. It's perfectly acceptable to use them in this context. A file not existing is an exception that arises from I/O, so catching it in I/O is fine and dandy.

catch :: IO a -> (IOError -> IO a) -> IO a (from System.IO.Error) takes two parameters - the first one is an I/O action. , the second one is the so-called handler. If the first I/O action passed to catch throws an I/O exception, that exception gets passed to the handler, which then decides what to do.

IOError is a value that signifies that an I/O exception occurred that also carries information regarding the type of the exception that was thrown.

We can't inspect values of the type IOError by pattern matching against them - how this type is implemented depends on the implementation of the language itself.

We can use a bunch of useful predicates to find out stuff about values of type IOError as we'll learn in a second.

import System.Environment
import System.IO
import System.IO.Error
main = toTry `catch` handler
toTry :: IO ()
toTry = do (fileName:_) <- getArgs
contents <- readFile fileName
putStrLn $ "The file has " ++ show (length (lines contents)) ++ " lines!"
handler :: IOError -> IO ()
handler e = putStrLn "Whoops, had some trouble!"

Just catching all types of exceptions in one handler is bad practice in Haskell just like it is in most other languages.

Modify our program to catch only the exceptions caused by a file not existing.

import System.Environment
import System.IO
import System.IO.Error
main = toTry `catch` handler
toTry :: IO ()
toTry = do (fileName:_) <- getArgs
contents <- readFile fileName
putStrLn $ "The file has " ++ show (length (lines contents)) ++ " lines!"
handler :: IOError -> IO ()
handler e
| isDoesNotExistError e = putStrLn "The file doesn't exist!"
| otherwise = ioError e

Everything stays the same except the handler, which we modified to only catch a certain group of I/O exceptions. Here we used two new functions from —

isDoesNotExistError :: IOError -> Bool (from System.IO.Error) is a predicate over IOErrors.

ioError :: IOException -> IO a,takes an IOError and produces an I/O action that will throw it. The I/O action has a type of IO a, because it never actually yields a result, so it can act as IO anything.

If the exception thrown in the toTry I/O action isn't handled, otherwise = ioError e will re-throw it.

More predicates:

· isAlreadyExistsError

· isDoesNotExistError

· isAlreadyInUseError

· isFullError

· isEOFError

· isIllegalOperation

· isPermissionError

· isUserError

userError is used for making exceptions from our code and equipping them with a string e.g. ioError $ userError "remote computer unplugged!". Although It's prefered you use types like Either and Maybe to express possible failure instead of throwing exceptions yourself with userError.

So you could have a handler that looks something like this:

handler :: IOError -> IO ()
handler e
| isDoesNotExistError e = putStrLn "The file doesn't exist!"
| isFullError e = freeSomeSpace
| isIllegalOperation e = notifyCops
| otherwise = ioError e

Where notifyCops and freeSomeSpace are some I/O actions that you define. Be sure to re-throw exceptions if they don't match any of your criteria, otherwise you're causing your program to fail silently in some cases where it shouldn't.

System.IO.Error also exports functions that enable us to ask our exceptions for some attributes, like what the handle of the file that caused the error is, or what the filename is. These start with ioe and you can see a full list of them in the documentation. Say we want to print the filename that caused our error. We can't print the fileName that we got fromgetArgs, because only the IOError is passed to the handler and the handler doesn't know about anything else. A function depends only on the parameters it was called with. That's why we can use the ioeGetFileName function, which has a type of ioeGetFileName :: IOError -> Maybe FilePath. It takes an IOError as a parameter and maybe returns aFilePath (which is just a type synonym for String, remember, so it's kind of the same thing). Basically, what it does is it extracts the file path from the IOError, if it can. Let's modify our program to print out the file path that's responsible for the exception occurring.

import System.Environment
import System.IO
import System.IO.Error
main = toTry `catch` handler
toTry :: IO ()
toTry = do (fileName:_) <- getArgs
contents <- readFile fileName
putStrLn $ "The file has " ++ show (length (lines contents)) ++ " lines!"
handler :: IOError -> IO ()
handler e
| isDoesNotExistError e =
case ioeGetFileName e of Just path -> putStrLn $ "Whoops! File does not exist at: " ++ path
Nothing -> putStrLn "Whoops! File does not exist at unknown location!"
| otherwise = ioError e

In the guard where isDoesNotExistError is True, we used a case expression to call ioeGetFileName with e and then pattern match against the Maybe value that it returned. Using case expressions is commonly used when you want to pattern match against something without bringing in a new function.

You don't have to use one handler to catch exceptions in your whole I/O part. You can just cover certain parts of your I/O code with catch or you can cover several of them with catch and use different handlers for them, like so:

main = do toTry `catch` handler1
thenTryThis `catch` handler2
launchRockets

Haskell offers much better ways to indicate errors in pure code than reverting to I/O to catch them.

Even when glueing together I/O actions that might fail, I prefer to have their type be something like IO (Either a b), meaning that they're normal I/O actions but the result that they yield when performed is of type Either a b, meaning it's either Left a or Right b.

Questions

“That's why in this case it actually reads a line, prints it to the output, reads the next line, prints it, etc”

Why does this process the whole file and not a line at a time?

main1 = do

handle <- openFile "abc.txt" ReadMode

hSetBuffering handle LineBuffering

contents <- hGetContents handle

putStr $ reverse contents

hClose handle

cons’ is the strict version of cons but is in the Lazy module: Data.ByteString.Lazy

Explain: Pure code can throw exceptions too they can only be caught in the I/O part of our code (when we're inside a do block that goes into main). That's because you don't know when (or if) anything will be evaluated in pure code, because it is lazy and doesn't have a well-defined order of execution, whereas I/O code does.

“we can't pattern match against values of type IO something“?

Matt's Haskell Learnings

Friday, February 10, 2012

LYAH Chapter 9b - More Input and Output

No comments:

Post a Comment