Datasets ▶ Other metadata scrapes
If you are interested in mirroring this dataset for archival or LLM training purposes, please contact us.
Overview from datasets page.
Source Metadata Last updated
Other metadata scrapes
👩‍💻 Anna’s Archive manages scrapes of metadata from other sources.
Varies

Various smaller or one-off metadata scrapes.

Collection Notes
cerlalc Page example AAC example AAC generation code Data leak from CERLALC, a consortium of Latin American publishers, which included lots of book metadata. The original data (scrubbed from personal info) can be found in isbn-cerlalc-2022-11-scrubbed-annas-archive.sql.zst.torrent. Special thanks to the anonymous group that worked hard on this.
czech_oo42hcks Page example AAC example AAC generation code Metadata extracted from CSV and Excel files, corresponding to “upload/misc/oo42hcksBxZYAOjqwGWu” in the “upload” dataset. Original files can be found through the Codes Explorer.
edsebk Page example AAC example Scraper code

Scrape of EBSCOhost’s eBook Index (edsebk; "eds" = "EBSCOhost Discovery Service", "ebk" = "eBook"). Code made by our volunteer “tc” here. This is a fairly small ebook metadata index, but still contains some unique files. If you have access to the other EBSCOhost databases, please let us know, since we’d like to index more of them.

The filename of the latest release (annas_archive_meta__aacid__ebscohost_records__20240823T161729Z--Wk44RExtNXgJ3346eBgRk9.jsonl) is incorrect (the timestamp should be a range, and there should not be a uid). We’ll correct this in the next release.

isbndb Page example AAC example

ISBNdb is a company that scrapes various online bookstores to find ISBN metadata. We made an initial scrape in 2022, with more information in our blog post “ISBNdb dump, or How Many Books Are Preserved Forever?”. Future releases will be made in the AAC format.

Release 1 (2022-10-31)

This is a dump of a lot of calls to isbndb.com during September 2022. We tried to cover all ISBN ranges. These are about 30.9 million records. On their website they claim that they actually have 32.6 million records, so we might somehow have missed some, or they could be doing something wrong.

The JSON responses are pretty much raw from their server. One data quality issue that we noticed, is that for ISBN-13 numbers that start with a different prefix than “978-”, they still include an “isbn” field that simply is the ISBN-13 number with the first 3 numbers chopped off (and the check digit recalculated). This is obviously wrong, but this is how they seem to do it, so we didn't alter it.

Another potential issue that you might run into, is the fact that the “isbn13” field has duplicates, so you cannot use it as a primary key in a database. “isbn13”+“isbn” fields combined do seem to be unique.

gbooks Page example AAC example AAC generation code Large Google Books scrape, though still incomplete. By volunteer “j”.
goodreads Page example AAC example AAC generation code Goodreads scrape by volunteer “tc”.
isbngrp Page example AAC example AAC generation code ISBN Global Register of Publishers scrape. Thanks to volunteer “g” for doing this: “using the URL https://grp.isbn-international.org/piid_rest_api/piid_search?q="{}"&wt=json&rows=150 and recursively filling in the q parameter with all possible digits until the result is less than 150 rows.” It’s also possible to extract this information from certain books.
libby Page example AAC example AAC generation code Libby (OverDrive) scrape by volunteer “tc”.
rgb Page example AAC example AAC generation code Scrape of the Russian State Library (Российская государственная библиотека; RGB) catalog, the third largest (regular) library in the world. Thanks to volunteer “w”.
trantor Page example AAC example AAC generation code Metadata dump from the “Imperial Library of Trantor” (named after the fictional library), corresponding to the “trantor” subcollection in the “upload” dataset. Converted from MongoDB dump.

Resources