How I Back Up Google Photos
Using ZFS and Photoprism
- Prerequisites and Setup
- Step 1 - Create Takeout Archives
- Step 2 - Downloading Takeout Archives
- Step 3 - Extracting Takeout Archives
- Step 4 - Deduplicating Media Files
- Cleanup
Google Photos is wonderful. It backs up my photos and videos, tags them, and makes them available to share with my family.
But I also don’t trust that Google will never accidentally “loose” my photos. Here is the process I use to back up my Google Photos data to a self-hosted instance of Photoprism.
This guide covers the process I use to back up my Google Photos content, in its original quality and with sidecar metadata, to Photoprism.
Prerequisites and Setup
My Photoprism instance runs on a TrueNAS Scale server as a kubernetes app. While yours doesn’t have to run in the exact setup, it’s worth describing. Photoprism has three primary storage folders, “Originals”, “Imports”, and a data path for sidecar files, multimedia caches, and the database file (if you’re using the built-in Sqlite storage, though I’d strongly caution you to set up an external MariaDB instance if you have a large photo library). Photoprism treats “Originals” as essentially a read-only store of media - anything you put in here will be imported by Photoprism as-is. Meanwhile any media put in “Imports” will be ingested by Photoprism and moved to the Originals folder; however you will have no control over the organization of these imported media. For our purposes, I strongly recommend putting your Google Photos media in “Originals” to preserve the folder structure Takeout outputs.
In my TrueNAS setup the Originals and Import folders are mounted onto ZFS datasets on the host, which allows me to
access and administer the media files while Photoprism isn’t running. Specifically, “Originals” is mounted to
/mnt/Bulk Storage/Photoprism/Originals, which we’ll just call $ORIGINALS from here on. First and most importantly,
we will be using ZFS extended attribute immutable to ensure that any files already in our “Originals” folder won’t
be overwritten when we extract the Takeout GZip archives. To do this, we can run the following command:
# Find all regular files in our Originals folder and mark them as +i (immutable).
find "$ORIGINALS" -type f -print0 \
| sudo xargs -0 -n 1 chattr +i
We’ll also be using a few tools not distributed with TrueNAS Scale (or more Linux distros, for that matter). Most can be installed via package managers, the rest can be built from source:
pigz- Parallel GZip implementation (homepage).progress- Coreutils’ progress viewer (not necessary - just nice; homepage).jdupes- Jody Bruchon’s enhancedfdupesclone (homepage).
Step 1 - Create Takeout Archives
- Go to Google Takeout.
- Click .
- Scroll down to “Google Photos” and select its checkbox.
- Scroll to the bottom and click .
- For “File type” select
.tgz, and for “File size” select 50GB. - Click .
Step 2 - Downloading Takeout Archives
Once your Google Takeout files are ready you will get an email (you can also check the Takeout site for completed archives). Depending on the size of our Google Photos library you may have multiple ~50GB files. I usually have around 10-15. I prefer to download one file at a time (depending on free disk space). Clicking the file will start a browser download, but I like to pause this download, right-click it and “copy download URL”, and then enter that into a download tool like Aria2.
Step 3 - Extracting Takeout Archives
Once I have some Takeout archive files downloaded (you don’t have to download them all at once, or even in order!) we’ll start extracting them into our “Originals” folder. Your Photoprism instance shouldn’t index Originals unless you tell it to, but I usually shut Photoprism down while performing these steps just to be safe.
Here's the script I use.
This will use pigz to extract the specified archive in parallel (for speed) and then extract the Google Photos files
into our “Originals” folder:
#!/usr/bin/env sh
# Usage:
# sudo ./extract-to-originals.sh takeout-archive.tgz
if [ "$(id --user)" -ne 0 ]; then
echo "Please run with sudo!" 1>&2;
exit 1;
fi
pigz --decompress <"$1" \
| tar --extract --strip-components=2 --skip-old-files --directory="$ORIGINALS" \
& progress --wait -monitor --command pigz;
chown -R photoprism:photoprism "$ORIGINALS";
Let’s break down what’s happening here:
if [ "$(id -u)" -ne 0];- checks if the script was run asroot(or withsudo).pigz --decompress <"$1"- decompress the GZip file specified in the first passed to the script.tar --extract… - extract files from the Tar archive thatpigzdecompressed. There’s a few extra options:- …
--strip-components=2…: remove some unnecessary path components that Takeout adds. - …
--skip-old-files…: skip files that already exist on disk. - …
--directory: extract files into our$ORIGINALSfolder.
- …
progress --wait --monitor --command pigz- monitor the progress of thepigzcommand in decompressing the input file.chown -R photoprism:photoprism "$ORIGINALS"- Becausetarattempts to preserve the ownership of the files it extracts, we need to correct this after extraction. Obviously update theuser:groupto suit your own needs.
Because the files already in our “Originals” folder are marked immutable tar can’t overwrite them - and in fact we
also provide tar with the --skip-old-files option to avoid a lot of error output about not being able to extract
existing files. We do this so that the next time you want to back up your Google Photos, any files that existed between
the last backup and the new one aren’t re-written and we can safely skip extracting them from the archive. This will
save a ton of time.
Step 4 - Deduplicating Media Files
Repeat step 3 for all of the Takeout archives - you can delete each one as you go once it’s been extracted. Once you’re
done, we can inspect how many files we have that are new by counting the files that are not immutable.
Here's a little script to do just that.
#!/usr/bin/env sh
if [ "$(id -u)" -ne 0 ]; then
echo "Please run with sudo!" 1>&2;
exit 1;
fi
find "$ORIGINALS" -type f -print0 \
| xargs --null lsattr \
| grep -- "----------------------" \
| wc --lines;
Breaking it down:
if [ "$(id -u)" -ne 0 ]- again, checks that we’re running asrootor withsudo.find "$ORIGINALS" -type f -print0- finds files in our originals folder. We use-print0to safely handle filenames with weird characters in them.xargs --null lsattr- for each file thatfindoutputs we want to runlsattr- this will list any extended attributes on each file.grep -- "----------------------"- for each line (file) of output fromlsattr, we’re only interested in lines (files) that have no extended attributes set (files that haveimmutable set would show as----i-----------------).wc --lines- count the number of lines (files).
Neat! Now let’s deduplicate all of these files. As mentioned in the aside above, Takeout will include multiple copies
of each piece of media, as each one can appear in multiple albums. To deduplicate this and save you a bunch of hard
drive space we’ll use jdupes. This tool will search for files that are identical and (in our case) turn them into
filesystem hardlinks - which means that the same data on disk can be pointed to by multiple file names.
First we need to change all files back from immutable. This can be done quickly with sudo chattr -R -i "$ORIGINALS".
Then we run jdupes, which has a lot of options, but the ones we want are:
sudo jdupes --link-hard --recurse "$ORIGINALS"
Aka:
--link-hard- we want to hard-link all identical files together.--recurse- consider all files within subdirectories of$ORIGINALSrecursively.
When this is done we’ll have potentially saved a lot of disk space!
Want to find out how much?
Here’s a short command that will tell you exactly how much space has been saved!
find "$ORIGINALS" -type f -links +1 -printf '%i %s\n' \
| awk 'a[$1]++{sum+=$2}END{print sum}' \
| numfmt --to=iec-i --suffix=B
Here’s the gist of what’s happening:
find "$ORIGINALS" -type f… - find allfiles in$ORIGINALS;- …
-links +1… - that have more than one hardlink; - …
-printf '%i %s\n'- and output the file’sinodenumber and file size (in bytes).
- …
awk 'a[$1]++{sum+=$2}END{print sum}'- this line is a bit heavy, but here’s what theawkscript is doing:-
a[$1]++Increment the value stored in
a[$1](where$1is going to be theinodenumber from each line). Theinodeis the unique data on disk - multiple file paths hardlinked to the same data will have the sameinodenumber. -
{sum+=$2}The
{sum+=$2}part is a conditional clause - that means that if the value before it is “truthy” it will execute. The first time a uniqueinodeis encountereda[$1]++will post-increment to 1, returning 0. This is falsy, so the conditional clause won’t run. Each time an inode is encountered after that, however, the value with be 1 or more, which is a truthy value and the conditional clause will run. All the clause actually does (when run) is increment the valuesum(which starts at 0) by$2(the file size in bytes). -
END{print sum}- when the script ends, print the value insum.
-
numfmt --to=iec-i --suffix=B- this converts the value ofsumoutput byawkto an IEC size (e.g.,1024would be converted to1.0KiB).
So as an example, say our “Originals” folder only contains three files:
- 🖼️
a.jpg - 🖼️
b.jpeg - 📄
sidecar.json
And a.jpg and b.jpeg are both hardlinks to the same 1MiB of data, and sidecar.json is a 1KiB standalone file. When
we run our script, the find portion will output three lines:
12345 1048576
12345 1048576
67890 1024
When awk gets these lines, it will perform the following:
- Line 1: Post-increment
a[12345](from 0 to 1), returning 0. 0 is falsy, so{sum+=1048576}doesn’t run. - Line 2: Post-increment
a[12345](from 1 to 2), returning 1. 1 is truthy, so{sum+=1048576}runs - sum is now1048576. - Line 3: Post-increment
a[67890](from 0 to 1), returning 0. 0 is falsy, so{sum+=1024}doesn’t run. - End of input:
END{print sum}runs, outputting1048576.
Finally, numfmt converts 1048576 to 1.0MiB, and we see our space savings is 1 megabyte! This is correct, as if
b.jpeg wasn’t hardlinked to a.jpg it would take an additional 1 megabyte.
Cleanup
Finally, we can re-mark our files as immutable:
find "$ORIGINALS" -type f -print0 \
| sudo xargs -0 -n 1 chattr +i
Now we turn Photoprism back on, and tell it to re-index new files in the Originals folder. It will even intelligently use the sidecar JSON files that Takeout exports to enrich each media item with additional metadata.
-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No bookmarks were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No likes were found.
{% endif %}-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No links were found.
{% endif %}-
{% for webmention in webmentions %}
- {{ webmention.title }} {% endfor %}
No posts were found.
{% endif %}-
{% for webmention in webmentions %}
-
{{ webmention.content }}
{% endfor %}
No replies were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No reposts were found.
{% endif %}-
{% for webmention in webmentions %}
- {% endfor %}
No RSVPs were found.
{% endif %}-
{% for webmention in webmentions %}
-
{% if webmention.author %} {% endif %} {% if webmention.content %} {{ webmention.content }} {% else %} {{ webmention.title }} {% endif %}
{% endfor %}
No webmentions were found.
{% endif %}