Improving my Information Infrastructure
Table of Contents
I find it inspiring to read about how people manage their personal information infrastructure. Ranging from setups like the one that Stephen Wolfram has built to the simple system that Jethro Kuan uses, I find each one instructional.
I enjoy bespoke set-ups, in which components of the infrastructure have been customized to closely match the needs of the individual using it. In particular, I like setups that uses FAIR data principles1, i.e, Findable, Accessible, Interoperable, Reusable. For an example, see Kieran Healy’s guide to plain text social science (PDF).
Why am I interested in things like this? I take a lot of notes, collect snippets and store many TBs of text and media files. I need a setup that is fast, and can navigate my digital hoard and I want this setup and the data to be under my control.
Why not use all those wonderful cloud-based software? They were designed exactly for these needs, right?
The problem with BigCorp
While there are commercial set-ups that offer fairly comprehensive infrastructure of their own for needs like this, these conveniences come at a price, and I don’t mean just high subscription costs. Here is a non-exhaustive list of the issues I have with the software services that I depended on:
- Proprietary Formats: The only place you can view and edit your data, is from within these systems. This is actually reasonable when the software is so unique, that it needs it’s own file format, but quite often, companies force formats onto users only to keep them trapped within their own ecosystems.
- Interoperability: Playing nice with other software in the user’s workflow is critical, but competitors rarely play nicely with each other, and customers are caught between their turf wars and miserly API strategies.
- Other types of Lock-in: The previous two points are just the tip of the ice-berg. There are ripple effects to tech lock-ins. When a platform locks it’s users in (I’m looking at you Linkedin, you swine), the lock in extends into psychological and cultural spaces. I have to log in to the platform to be part of my group, otherwise I will be excluded and forgotten. (Instead of repeated congressional hearings, shouldn’t Governments make distributed social networks a mandatory framework for social products?)
- Dark Patterns and Self Serving Workflows: Many services use Dark Patterns to trick you into doing things a certain way, almost always to the benefit of the company. When a service prescribes a certain workflow, it is almost inevitably to cross-sell another product in their stable, and not because it simplifies things for the user. There is also the behavioural side – you get used to the workflow of the service, instead of building your own. When the service goes out of your life, your workflow evaporates, leaving you with bupkis. Ask anyone who used Wunderlist or say, Google Notebook.
- Snooping on, tampering with, losing user data: Service providers have blatantly tampered with user files (Amazon deleting ebooks off user’s devices), locked paying customers out of their own data (Google locking users out for trivial reasons), have changed policies to allow snooping on user data (the Evernote controversy) and this barely scratches the number of atrocities committed on users over the years. See Karl Voit’s post for a damn good collection of examples. It blows my mind that these are not considered serious crimes.
- Simply disappearing one day: I’ve lost count of the number of services that I invested time and energy into, learning keyboard shortcuts, adapting workflows, storing my hard won data only to wake up one day and find them… just gone. Some go out of business, others are accquired and then shut down. Some just die because a tech giant killed a thriving open format, and now they no longer have a reason to exist.
There are a lot more reasons and while many commercial setups provide incredibly useful software that are beautifully designed and convinient, they are sometmes also a faustian bargain.
Building my own FAIR system
My own personal information management needs are easy to describe. Broadly, there are two types of information that I manage day to day – Short and Long term.
Short term Information are about things that I will not care about in a year or two. Five years for some of it, probably, just to be safe. Examples are Project Notes and plans, Business CRM information, meeting notes, emails and all that detritus that accumulates as we wade through life.
Long term information are about things with a longer horizon, some that I will care about for the rest of my life. Examples are Learning and Upskilling notes, reading notes, my physical and digital libraries, my personal CRM, journals and the sort of things that will shape me over time.
I prefer to keep these separate, both as concerns as well as physically, in different parts of my file system.
Plain Text as the foundation
Sometime around 2010, I began to migrate away from database-driven information storage to the files-in-a-folder paradigm wherever possible. For example, instead of storing all my notes in Evernote, I saved each note as a plain text file in a folder. Another example is email, for which I use the Maildir format which stores each email I receive as a plain text file in a local folder.
Now in 2024, Large parts of my Personal Information Infrastructure exist in either plain text format, or in some lightweight format like xml or json. This includes both my short term and long term information. All of it is indexed in some form or the other, which means I can retrieve what I need reasonably fast. The benefits of my setup as of today are:
- Privacy and Security – Almost all of my infrastructure is no longer on the cloud, or in the databases of commercial providers.
- Easy to back up and Version Control: They are all just text files. No risk of database corruption.
- Portable and Interoperable – My information is accessible to me on any device I own, it’s lightweight, and it’s very easy to move my information between applications – no import/export hassles.
- Easy to query and retrieve data – I use many different ways to find and retrieve data fairly fast.
- I own my data completely – (but i’m also responsible for managing and securing it.)
This is a massive leap forward, and solves most of the problems that I had with commercial providers. In fact, not only have I replicated the features that I need, my setup has in some cases outperformed these original services. Still, I feel like I can do better.
Here, searching for the term Procedural Memory using ripgrep fetches results even as I type, from a folder that contains thousands of notes files. There is no wait time. This is a good example of why my system is better today. Being able to add the best tool for each of my needs, beats having to live with mediocre ones in ‘feature rich’ commercial software.
The Goal
The goal is to prepare a corpus of personal information in text file format, that I can use, once personal LLMs like Khoj become a little easier to set up. Having a personal LLM is going to be amazing, and all of this will hopefully culminate in having my own little Jarvis answering questions using my own data – for an audience of one.
Can I do better?
While I wait for personal LLMs to become more commonplace, here is a list of minor improvements I want to make in the meantime.
Learn Touch Typing
With all this talk of text this and text that, I still hunt and peck. Learning touch typing has been an uphill task for me so far, because i’m pushing against 30 years of bad typing habits. I’m working on my third attempt at learning typing now.
Get better at using Text Editors, particularly learn about the Modal Editing paradigm
My first text editor was notepad, on Windows, and when I moved to Linux around 2005, I moved to Gedit. I love text editors because they are so fast! I can be in and out in the time a word-processor even creates a new file. I have since tried Atom (now dead), Sublime and more recently VScode. (Nice software, smooth experience but oh god, the UI clutter!) However, I’d always fall back on Gedit, which was nice since I packed it with useful plugins and snippets. It wasn’t until I discovered Emacs, that I understood what I was really missing out on. Emacs lets me perform surgery on text, without ever having to touch my mouse. Using Macros to transform large bodies of text (manually converting a csv file, into a ledger file, for instance) was a revelation. Ditto for niceties such as undo, but only within a selected region, amazing search and replace and so on.
Surgical strike on text. A contrived example using bad poetry to illustrate how I can jump around in a screenful of text in Emacs, transpose words, find and replace characters, cut sections and paste them elsewhere and so on. I don’t need to touch the mouse.
Now I’ve been using Emacs for about 3 years and i’m curious about modal editing such as offered by editors like vi and kakuone. Those who use modal editing, speak highly of it, and yet it seems so counter-intuitive. Why on earth would I need to jump in and out of modes just to insert text? Isn’t it natural to edit and type in just one editing interface? Still, modal editing, allows composition of commands that I have never encountered before and I should at least try to learn a little more. (See this video.)
For now, I know about 5 vim commands out of necessity. I can just about create a new file or edit an existing one, jump into and out of insert mode, edit my text, save and quit. That’s it.
In 2024, I’d like to learn a little more of Vim, and once I’m a little comfortable, I’d like to try Doom Emacs which combines vim style modal editing with all the power of Emacs. Given how disruptive this could be to my current system, I consider modal-editing a low priority need – a novelty that I may explore if I have the time.
Search my old mail archives
I currently use Gmail for both my personal and business email. The email for our foundation is hosted elsewhere. I use mbsync to download all my email in Maildir format, and use mu to index it.
This is what my email looks like in mu4e. It’s all locally indexed which means I can search thousands of emails to find what I need quickly. Processing email is also very fast, as you can see in this example – I have the search results of a bunch of marketing emails. I open one to see if it is useful, then delete the rest of them lickety-split.
I archive my email every 5 years or so, which means I have old Maildir directories I have lying around that I’d like to search occasionally. I think I can use evolution and point it at the directories, but what i’d like is to use notmuch. I already use mu4e on my active maildir, and I’m eager to see if I can set up notmuch to search my archived ones.
Getting better at the Terminal
I’ve been afraid of the terminal for a long time, ironic given that my computing life started at a DOS terminal in 1993. I only ever jump into the command line to do administrative tasks, starting or stopping certain services, but I am intimidated by it. So, this is something that I’d like to overcome. I have started navigating the file system in the CLI, searching and reading text files using commands like grep and cat. I’ve also installed fzf to make working in the terminal a little easy. My goal is to be comfortable searching, reading, editing and operating on text files in the CLI.
Syncing, Backups and Version Control
Ever since I moved to Syncthing a couple of years ago, my inter device data syncs have been pretty sweet. I moved when I realized that Dropbox would set me back by Rs.15k a year, and my paid-for Google Drive storage was also creeping up to full. Today, syncthing does all my syncing for me. For backups, I use rsync to back up to an external HDD every night. I’m happy with the setup so far. For version Control, I’ve experimented with Bazaar in the past. I don’t really need Version Control except for some edge cases such as long form articles or for complex proposal writing. I’d like to learn how to use Git, because it’s everywhere I look.
Scripting for Automation
I like to automate or at least semi-automate a bunch of things in my information setup. For instance, I use text-expansion to do a lot of things, from generating file names to creating email replies. I use helper tools like Kupfer to do things like append to text files, run searches from anywhere and so on.
The benefit of semi-automating file names is in their consistency. This is a view of my general notes and trivia folder. I can search and retrieve files rapidly because each type of file (notes, project plans, financial documents etc.) has unique metadata2 such as a label, an immutable timestamp, a category code and ‘tags’ alongside the actual filename. Something like this is a pain to enter manually each time. Text expanison makes it easy.
However, all of this is possible because I use GUI apps, and I’d like to peek behind the curtain. I think if I learn a little bash and python scripting, just enough to automate things such as batch conversions of file formats, triggering certain actions and so on, I should be golden. This is a big gray area for me at the moment, considering I have zero programming experience.
A Better Website
I update this site so rarely because of what a pain it has become. I first started using WordPress around 2005, when you had to hand-roll your blog. It was a zippy little thing. It is no longer a blogging framework – it is a gigantic CRM with the proverbial kitchen sink3. It feels like there are too many steps between my writing, and it’s appearing on this website. I’ve been eyeing Hugo and its simplicity for some time now. There is a slight learning curve, but i’m looking forward to the experience of working with a static website generator.
So, that’s my list of things to work on for this year. I’m certain that as I learn and implement some of these ideas, I’m going to discover a lot more connected concepts.
Notes
-
FAIR data principles are especially important in academia and research. In spaces where information and knowledge needs to flow seamlessly between collaborators, without the artificial restrictions imposed by commercial vendors, such as tech lock-ins. See this article on how a scientist uses Emacs, a text editor to ensure adherence to FAIR data principles. ↩︎
-
Over the years, I developed my own filenaming system through trial and error. Each broad area (project files, research, plans…) has it’s own label. The one for Projects looks like this. PRJ_240325_CLIENT_project_title_comes_here.txt ↩︎
-
In an interview with Tim Ferris, Derek Sivers had this to say about WordPress:
_“… WordPress is like, I think last time I counted 38 billion lines of code. And it does way more than what you need. So it’s kind of like if you said, “I need some scissors,” and somebody handed you to the contents of an entire hardware store. You’re like, “No, I really just need to cut this.”
_Source ↩︎