On Needles in Haystacks
The conference room was chilly, the meeting was definitely over, and I could almost taste the chai I usually hanker after excessively long client meetings. The hold up was that the client wanted to show me a presentation she used last week, which she thought was fantastic, and could I please wait a few minutes while she fished it out from the depths of her laptop.
It was painful to sit there, staring at the large screen projection of her file system, as she clicked in and out of folders. Some 15 minutes later, she gave up the search.
Annoying as that was, I sympathise with her. Our computers make it easy to save stuff without a second thought on how we plan to find it later. According to a McKinsey study I found, people lose about an hour and a half a day, simply looking for files.
In the year 2000, I worked as the head of Information Architecture at a hospital in the U.S, designing efficient search and retrieval for a Patient Information Portal. The way we approached this was to build controlled vocabularies for the navigation and search system. The effort involved my I.A team, a consulting team from Arthur Andersen and a content team on the payroll of the hospital. With all this firepower, it still took us close to two years to complete the project and even then, I question how effective we were with the outcome.
Building effective search and retrieval systems only gets you so far. The problem that actually needs solving is bad content and its metadata. Cory Doctorow wrote a piece called Metacrap in which he described this pillaging of the commons. “Meta-utopia is a world of reliable metadata. When poisoning the well confers benefits to the poisoners, the meta-waters get awfully toxic in short order.“
Things aren’t much better at the local level, meaning, here on my own hard disk. While I have consulted on Information Architecture projects for clients, I haven’t always applied this sort of rigor to my own data – at least not until a decade ago. I mean, I started my computing life on a DOS platform, so I learned very early that it helps to have decently named files if I mean to find them later, so I have always been careful with naming, but there never was a method. It was all arbitrary.
These days, I have some rules in place to ensure I can find what I need efficiently (enough).
- Sensible folder structures
- Consistent and self-evident file and folder names
- Time stamping everything
- Adding Tags and Keywords wherever possible
- Linking between things wherever possible
- Using better search software
Sensible folder structures
Once upon a time, back when I used Windows as my OS, I stored my data all over the file system. The reason I did this was to save whatever I was working on quickly, and just go back to the document. The downside? Search and retrieval was a pain.
Today, there is a top level folder, inside which I try to maintain no more than 10 sub-folders. This helps me maintain a mental map of all the sections that matter. Search and retrieval becomes easier because I can drill down to the sections that matter before I initiate the search, and not be bombarded with false positives. This makes life easy for syncing and maintaining backups. I only need to sync the top level folder to have my data across devices, or just backup that one folder, and I’m done.
Consistent and self-evident file and folder names
It is always worth the few seconds it takes to write a decentish file name. The trick is to imagine yourself looking for this file a month from now, when you no longer have situational context. What are you likely to search for? So instead of writing something like :
Proposal final 3 final final.pdf
I try to maintain a consistent structure for the file name – something like:
250217T095411_client_projectname_proposal--tag1-tag2.pdf
The first bit is a date and time stamp1, which lets me search by date if I need to. It also acts as a versioning system. The latest date and time indicate the most recent version. The next section provides client and project context, and finally, I add tags right in the filename because tags are great for providing additional context. If the latest version of the document has updated tables, then the tag would read as: tables-updated.
This makes search and retrieval efficient because in many cases, I remember the rough time period in which the document was created, so a search for date narrows down the search results. Adding client and project detail narrows it down even more, often giving me the exact result I was looking for.
Time stamping everything
I covered timestamping in the previous section. The long and short of it is that I timestamp everything, not just what I create, Even downloads get a timestamp, inserted at the head of the filename. I use Autokey, text expansion software to create date and timestamps in just a couple of key presses.
Adding Tags and Keywords wherever possible
I use plain-text2 for any text related work. Eventually my text may end up as a Google doc, or a PDF or HTML or whatever, but it always starts life in plain text. I add tags and keywords to the body of the text file. The tags will typically look like @ref_thisisatag or, if I’m in Emacs Org-mode, it may look like this :thisisatag:
Additionally, I also add a line of comma separated keywords at the end of the document. The reason I add tags and keywords is to help with full-text search, if I ever need to run one. Generally, when I search, my file explorer searches just the file names. Sometimes, I need to dig deeper and search the content as well. The tags and keywords help with that.
For media files such as photographs and video, I use Tagspaces, which is pretty great. It lets me tag photos en masse, and inserts the tags right into the file name. This is really useful. For instance, I run a sketching community that meets once a week, and I take pictures at each event. When I dump them on my computer later (I don’t use Google Photos) These images are tagged with the term penciljam and also the name of the location we sketched at. When I run a search for these tags, I can find penciljam photographs for the entire year, by month or I can run a boolean search that lets me filter by time and location and so on. Very handy!
Linking between things wherever possible
I do my writing mostly in Emacs Org-mode and sometimes in Zim-wiki. Both allow me to create clickable links between my documents. This is great because I can jump around to related documents when I’m working and in the context of search, even if I find a related document, I can click through to the actual document I need.
Using better search software
In the early 2000s I used a software called Google Desktop. This let me search my entire system from a widget that sat on my desktop. It wasn’t fast, but it sort of worked. Then Google killed it, as it generally does when people find something useful in its stable.
I think full text search is amazing, and everyone needs at least some type of full text search option on hand. (I even have one on my Android phone.)
Nemo, the default file manager on Linux Mint, allows full text search, but it isn’t particularly fast. For quick results, I prefer standalone software, so I’m on Emacs, I use Ripgrep to do full text searches and everywhere else, I use the amazing Recoll.
Recoll in particular is fabulous because it can chew through a huge variety of file formats and when it returns hits, it lets me see the results across text files, PDF documents, images and email messages. This is really handy because I can see the email exchanges, the charts I created and the documents that I wrote for a project in one place.
I really like my current setup. I can generally find what I’m looking for fast enough to not break flow (most times). I still get occasionally stumped, but it is nowhere near how bad it used to be a decade ago, but these rules and methods may not matter anymore, because with A.I. you don’t need to search and retrieve.
You can go for the punchline directly. Instead of searching for a proposal from last year, I can ask the agent to simply summarise it for me and work with the generated output instead. I’m still trying to figure out how to train an LLM on my corpus of documents, but I think it is only a matter of time before we are all inundated with consumer level products that will do this for us – whether we want them to or not.
Notes
-
My date prefix for files follows a ymd_filename.extension. (Which looks like 250217_blog_post.org) I have started using bits of the logic used by Protesilas for his Denote note-taking system, which follows a
Ymd_T_
HMS- -filename__keywords.extension, (250217T193157–blog-post__pim_pkm.org) which I find useful in some cases, specifically cases in which the filenames and keywords are the same. The keywords (or tags) are a particularly nice touch. ↩︎ -
Using Plain Text for all things text is a personal life choice. When I was first introduced to the concept, I really didn’t get it, and I didn’t want to leave my rich-text world of Evernote and Onenote and other fancy writing software. Zim-wiki is what made me make the leap some 15 years ago, and now I’m a die-hard fan. It’s not just me though. Check outwebsites like Plain Text world, Plain Text Project, Plain text productivity, or even Plain text Accounting. ↩︎