We Are Now Generating Info That Requires Persistence
We have been opening and using files for many installments in this Cooking with Python and KBpedia series. However, in the past few installments, our methods are also beginning to generate serious results, some of thousands of lines. It is time for us to learn the basic commands for writing and reading files. We will also weave in a new Python import into our workspace, the CSV module. It will be important for us as we begin to import the basic N3 and CSV files that underly KBpedia‘s build process.
Another point we weave through today’s installment is the usefulness of following Python’s lead in composing the names of our files and methods in a hierarchical and logically named way. Such patterns mean we can more readily find and save the information we want to keep persistent. This patterning, along with some directory structure guidance we will address a few installments from now, help set up a logical way to manage and utilize the information assets in KBpedia (or your own domain extensions of it). Logical organization helps in a system designed for subset selections, semantic technology analysis, and machine learning, where the system itself is built from large, external files and we roundtrip with extractions from the current state of the graph.
But reading and writing files is not a new subject for this series. Since early with our exposure to owlready2 and Jupyter Notebook we have been loading and inspecting parts of KBpedia. You may recognize this call from CWPK #19 regarding the smaller KKO (Kbpedia Knowledge Ontology)
kko.owl. We have not set up this notebook page sufficiently yet in this installment, so if you run this cell you will get an error:
file = "C:/1-PythonProjects/kbpedia/sandbox/kko-test.owl", format = "rdfxml")onto.save(
File Object and Methods
To get started, let’s first focus on the file object and the methods that may be applied to it in Python. A ‘file object’ has the following components:
- A physical file address, which can be referenced by a variable name
- A stated form of text encoding, which we standardize on as UTF-8 in KBpedia
- A method to be used when opening, whether read or write or append or others noted below.
A ‘file object’ is given a variable name, which in this section let’s simply call
close() methods only apply to file objects. Thus, one can not simply ‘open’ a physical file address without first associating that physical file to a file object.
Once a ‘file object’ is defined by assigning it a variable name, besides
f.close() certain other methods may be applied to the object:
f.read(size)– returns the text or binary document up to the ‘size’ indicated, which if omitted (
f.read()) returns the entire file
f.readline()– returns a single line from the file with a new line character (
\n) appended to each line
f.write(string)– writes the contents that must be a string or a string variable, and returns the number of characters in that string
f.tell()– returns an integer giving the file object’s current position in the file
f.seek(offset, where)– re-sets the file object’s location by an offset where the reference point is by convention either the beginning of the file (0), the current position (1) or the end of the file (2), with 0 the default
help(file)below and for many other more obscure ones:
The print() Function
A quite different but major way to get output from a Python program is the
print() built-in function. Because it is often the first statement we learn in a language (‘Hello KBpedia!’), we tend to overlook the complexity and power of this function.
print() statement provides two kinds of output from Python. The first is the result from an evaluated expression (or code block), say the calculation of a formula. The second kind is for strings (that is, text and written numbers), which can be manipulated in many ways. Either output type may be directed to a file object. If all of the values passed to a print function are strings (or converted to such prior), then virtually all string
str functions are available to manipulate the entire list of value information passed to the function.
I encourage you to keep an eye out on the manifest ways the
print() function is used in code examples you might inspect.
Open, Close and Reading Files
It may not be obvious, but the underlying method to the Owlready2
get.ontology(file) method above is a wrapper around the standard Python file
open method works with different formats:
file = open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl','r', encoding='utf8') print(file)
FileNotFoundError Traceback (most recent call last)
<ipython-input-1-70fb0322830f> in <module>
----> 1 file = open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl','r', encoding='utf8')
FileNotFoundError: [Errno 2] No such file or directory: 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl'
Now, if you repeat the command above, but remove the encoding argument and run again, you are likely to get an output format that indicates
encoding='cp1252'. This kind of default encoding assignment can be DISASTROUS in working with KBpedia, where all input files and all output files must be in the UTF-8 encoding. It is best practice to specify encoding directly whenever opening or writing to files.
Here is a slightly different format for opening a file now using a file object method of
= 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' filename file=open(filename, 'r', encoding='utf8', errors='ignore') file.read() print(file)
And, here is a third format using the ‘with open’ pattern of nested statement:
with open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl') as file: file.read() print(file)
Oops! Did not work. Again, however, because we did not specify our encoding, we again get the default. We need to make the encoding explicit. Another thing to look out for is the separation of arguments by commas, which if missing, will throw another error:
with open('C:/1-PythonProjects/kbpedia/sandbox/kko.owl', encoding='utf-8') as file: file.read() print(file)
with open statement above is the PREFERRED option because the ‘with’ format includes a graceful ‘close’ of the file should anything interrupt your work or the
with open routine completes. Under the other options, a file can be left in an open state when a program terminates unexpectedly, possibly affecting the file integrity.
In looking across the options above, let’s make a couple of other points besides the need to specify the encoding. The first is that the file is opened by default in
'r' read-only mode. You can see that mode specified in the
'w' when you wish to write or overwrite to the file, and
'a' for append when you wish to add to the end of a file. Here is the complete suite of file mode options:
'r'– opens a file for reading only
'r+'– opens a file for both reading and writing
'rb'– opens a file for reading only in binary format
'rb+'– opens a file for both reading and writing in binary format
'w'– opens a file for writing only
'a'– open for writing; the file is created if it does not exist
'a+'– open for reading and writing; the file is created if it does not exist.
Another thing to observe is that Python may accept more than one name alias for the encoding. Our examples above, for example, use both
'utf-8' for the encoding argument.
Also, as I admonished early in this CWPK series, try to always assign logical names to your physical file paths. As I noted earlier, there are tricky ways that Windows handles file names versus other operating systems and keeping (and then testing) proper file recognition in a separate assignment means you can develop and work with your code without worrying about file locations and paths. You may also do things programmatically to update or change the file referent for these logical names such that the actual file opened may be specified to a different physical location depending on context.
A last thing to notice is that things like encoding or mode need not be specified as arguments in a given method command. When a default value is given at time of definition of the method (notably something to inspect for ‘built-in’ Python methods such as
file), that argument can be left off what is actually written in code, with the default assignment being used. It is thus important to understand the commands you use, the options you may assign directly as an argument assignment, and the defaults they have. Whenever you get into trouble, first try to understand the full scope of the statements and their arguments available to you using the
Proper exiting of an application or writing to file generally requires you to
close() the files you have opened. Again, if you open with the
with open pattern, you should generally close gracefully. Nonetheless, here is the formal command, taking advantage of the fact we gave the physical file the logical name of ‘file’:
Well, apparently we have the KKO file object loaded, and we have seen the system recognize the file, but we still see nothing about what the file contains. Generally, of course, we need not inspect contents so long as our programs can access and use the data in them. But in some cases, like now when we are developing routines and we are validating steps, we want to make sure everything has opened properly for reading or to receive our outputs.
At this point, we believe by running the cells above, that we have the
kko.owl file in memory using the UTF-8 encoding. Let’s test this premise.
= 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' filename file = open(filename, 'r', encoding='utf8') print(file.read())
Again, while we specified the
'r' reading option, that was strictly unnecessary since that is the default for the argument. But, if in doubt, there is no harm in specifying again.
Here is another format for looping through a file line-by-line, now using an explicit
for loop and using a logical
filename for our physical file address:
= 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' filename with open(filename, encoding='utf-8') as file: = file.readlines() lines for line in lines: print(line)
Hmmm, now that is interesting. The file appearance seems to skip every other line. That is because there is a whitespace character at the end of each line as we noted above under the file discussion. There is a string method for taking care of that,
.rstrip() that we add to our routine:
= 'C:/1-PythonProjects/kbpedia/sandbox/kko.owl' filename with open(filename, encoding='utf8') as file: = file.readlines() lines for line in lines: print(line.rstrip())
The latter iteration option results in us being able to manipulate a string object in the line-by-line display, whcih means we may invoke many
str options. Besides the example, two related methods are
.lstrip() to remove leading whitespace and
.strip() to remove both leading and trailing spaces.
There are as many ways to iterate through the lines of a text file as there are ways to specify loops and iteration sequences in Python using
while and other iteration forms. Also, there are many ways to conduct string manipulations including case changes, substitutions, counts, character manipulation, etc. To see some of these string (
str) options, let’s try the
dir() command again:
Like the options for reading a file, there are a number of ways to write output to a file.
In the ‘write’ examples below I have switched our variable file name from ‘file’ to ‘my_file’. Though it is the case there are some Python keywords that you may not use (they will throw an error if used as variable names) and ‘file’ is NOT one of them (search on ‘Python keywords‘ to find a listing of them). ‘file’ is also a not uncommon argument for some methods, including for the
Some programmers shorten such variable references to single letters, as we did so ourselves in the last installment (where the variable prefix went from
a_). That style is OK for generic routines and ones perhaps using internal standards, but more descriptive variable names are helpful when your code is being used for learning or heuristic purposes, as is this case.
OK, so let’s look at some of these writing options:
= 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt' filename = open(filename, 'w', encoding='utf8') my_file print('Hello KBpedia!', file=my_file) my_file.close()
Note in this form we are continuing to specify the encoding and have changed the default
'r' argument switch to
'w' because we now want to be able to write to the file. (Note also we have changed the filename to a name something other than our existing files so that we do not inadvertently overwrite it.) We also need a
close() statement to complete the write action and to properly close the file. After you run this cell, go to your standard directory where you first stored your local knowledge graphs and see the print statement in the new file.
The next format uses our preferred form (though if the file is only being created and opened for immediate writing the above form is fine):
= 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt' filename with open(filename, 'w', encoding='utf8') as my_file: print('Hello KBpedia, again!', file=my_file)
Once you have made your file declarations, you may also just write your statements as generated to the file. Notice for the third write statement below that we needed to mix our single and double quotes in order to include a possessive apostrophe in the statement.
= 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt' filename = open(filename, 'w', encoding='utf8') my_file 'Slipping in a reference to KBpedia.') my_file.write(# More Python stuff 'And, then, another reference to KBpedia.') my_file.write(# More Python stuff "Because I can't stop talking about this stuff!") my_file.write(# More Python stuff my_file.close()
But, when we run this cell, we find the file has its text all on one line. Since we don’t want that, we make modifications to the output statement, similar to what we might do for
= 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt' filename = open(filename, 'w', encoding='utf8') my_file 'Slipping in a reference to KBpedia.', '\n') my_file.write(# More Python stuff 'And, then, another reference to KBpedia.', '\n') my_file.write(# More Python stuff "Because I can't stop talking about this stuff!", "\n") my_file.write(# More Python stuff my_file.close()
Grrr, I guess the
.write method does not work the same as
print(). But the type error indicates we can only pose a single argument to the statement, so we need to get rid of the second argument designated by the comma. Since we are working only with strings here, we can concatenate to get our statement down to a single argument (because what is first evaluated is between the parentheses, which results in a single value passed to the call):
= 'C:/1-PythonProjects/kbpedia/sandbox/write-test.txt' filename = open(filename, 'w', encoding='utf8') my_file 'Slipping in a reference to KBpedia.' + '\n') my_file.write(# More Python stuff 'And, then, another reference to KBpedia.' + '\n') my_file.write(# More Python stuff "Because I can't stop talking about this stuff!" + "\n") my_file.write(# More Python stuff my_file.close()
Better, that is more like it as our output file is formatted as we desire.
You can also generate write statements that join together strings in various ways (this snippet does not work alone):
' '.join(('Hello KBpedia!', str(var2), 'etc')))my_file.write(
This brief overview points to either file object methods or the print function as two ways to get output out of your programs. Further, within each of these two major ways there are many styles and approaches that might be taken to get to your desired output goal.
In the case of KBpedia where we use flat CSV files as our canonical exchange form, which are themselves by definition built from strings, we will tend to use
write() function as our preferred way to prepare our strings for output. However, when reading the external files, we tend to use the file object read methods.
Let me offer one final note on output considerations: Since we have only a relatively few generic processing steps for either extracting or building KBpedia, but ones that repeat across multiple modules or semantic object types, we will try to find ways to compose our file names from meaningful building block elements for consistency and understandability. We will start to see this in a fragmented way initially with our function and output definitions. When we get to the project-wide considerations, though, toward the concluding installments, we will be consolidating these fragments and building block considerations in a way that hopefully makes overall sense.
Using the CSV Module
Though CSV files are easy to generate, manage, and inspect, and there is a formal standard with RFC4180, actual implementation is more like the Wild West. Delimiters other than commas or tabs may be used (semi-colons, etc.) to separate values. Specific purposes may add local specifications, such as the ‘double pipe’ (‘||’) convention we have adopted for multiple entries in a cell. Treatment of quoted strings, including what to quote and how to quote, may differ between applications. We have also discussed the importance of standard encodings, the failure of which to use may lead to disastrous file corruptions.
CSV files in implementation have a standard layout of rows and columns, which is good, sometimes with headers and sometimes not. Though Microsoft Excel is a huge application for CSV files, Excel does not use UTF-8 as its standard and sometimes does other interesting things to its cell contents. It would be nice, for example, to have recognized templates that would enable us to move from one CSV environment to another. At minimum, we want to impose rigor and consistency to how we handle CSV files to prevent encoding mismatches or other discontinuities.
To help overcome some of these challenges we are using the Python csv module. Let’s first look and explore what functions this module has:
import csv dir(csv)
Here are some of the attractive features of using the CSV module as our intermediary for data exchange. The CSV module:
- Uses the same file object functions as standard Python, including an expanded
- Recognizes ‘dialects’, which are templates of processing specifications that can be defined or link to existing applications like Excel
- Has a sniffer function to discover dialect regularities in new, wild files
- Allows different quoting stringency levels to be set (all strings, multi-word strings, etc.)
- Allows different delimiters to be set
- Allows headers to be used or not
- Recognizes field names for specific data columns
- Enables Python dictionaries to mediate field names to master data.
Though we have not yet come to our ingestion (build) steps, when we do we will have need for some fields to iterate multiple items to store in a single field name and to process it with a different delimiter (‘||’). This and the master data dictionary aspects look promising.
Saving from Jupyter Notebook
To get output from your Jupyter Notebook, pick File → Download as to get full notebook outputs in nine different text-based formats and PDF. Of course, individual cells may also have their code blocks outputted via the Python functions discussed above.
Reading List and Additional Documentation
There are many fine online series and books and many excellent printed ones with basic Python documentation. Citations at the bottom of many of these CWPK installments have links to some of them.
If you are to follow this series closely I heartily recommend that you do so with a printed Python manual by your side that you can consult for specific commands or functions. I have spent some time looking, and have yet to find a single ‘go-to’ source for Python information. My most frequent sources are:
Mark Lutz, Learning Python, 3rd Edition, 2008. O’Reilly Media, Inc., Sebastapol, CA, 706 pp. ISBN: 978-0-596-51398-6
David Beasley and Brian K. Jones, Python Cookbook: Recipes for Mastering Python 3, 3rd Edition, 2013. O’Reilly Media, Inc., Sebastapol, CA, 692 pp. ISBN: 978-1-449-34037-7
Bill Lubanovic, Introducing Python: Modern Computing in Simple Packages, 2nd Edition, 2020. O’Reilly Media, Inc., Sebastapol, CA, 602 pp. ISBN: 978-1-492-05136-7
Here are additional links useful to today’s CWPK installment: