Data for Research
A Text Mining service from JSTOR
DfR HomeDfR Help

Technical Specifications

This page documents how we package, zip, and deliver files for both self-service and large/full-text datasets available through Data for Research. Specifications and examples for examining the structure of the metadata, full-text, and n-grams file types are also included.

File Delivery

If you requested a dataset through the self-service Data for Research site (www.jstor.org/dfr/), an email message with URL(s) for downloading the datasets will be sent to the email address registered with your account. Large/full-text datasets are processed by the JSTOR Support team and will be sent to you after the necessary agreement is completed and dataset is processed by our staff.

Each 1.5 GB (max) zip file resulting from the dataset parameters will have its own access URL. The URL does not require a login to access the zip files and will look like: http://data.jstor.org/deliver/#unique-string

Please note that you will have 60 days to download the zip files once they are created. After 60 days, the links will expire and a new request must be submitted.

Zip File Creation

A zip file can contain a mix of all content types (journals, books, research reports, pamphlets), depending on the dataset you have requested. Multiple zip files may be necessary for larger datasets.

Dataset zips are named using a “receipt” identifier generated by our system. Within the zip, files are grouped by their type into separate folders and with the following naming conventions:


general structure of the file types
output type directory file structure
Metadata metadata/ <content-type>-<item-doi>.xml
N-Grams ngrams[1-3]/ <content-type>-<item-doi>.NGRAMS[1-3].txt
OCR/Full-text ocr/ <content-type>-<item-doi>.txt
The following document types will be included (if desired):
  • Journal documents (articles, book reviews, miscellaneous)
  • Book documents (chapters)*
  • Research report documents
  • Pamphlets

*Note that we do not provide XML metadata at the chapter level, so if your dataset includes chapters you will get the full book XML each time a chapter record is processed.

Metadata Scheme

The XML files are presented:
  • In UTF-8, with only amp, gt, lt, and quot used as character entity references when necessary
  • Without a doctype declaration
  • Well-formed and schema-valid with respect to the appropriate XML Schema

Journal and Pamphlet Metadata

Journal articles follow the JATS 1.0 Archiving tag set (as retrieved in its XSD form in May of 2013). Each file contains an /article/front/journal-meta with brief details about the journal, and /article/front/article-meta element describing an article.

The journal-meta element contains elements that describe the journal, significantly:
  • journal-id journal-id-type="jstor" or "publisher-id", a unique system identifier for that journal.
  • issn elements, for the ISSN values that apply to the journal
  • journal-title, as it was included in the source, not necessarily what the currently-publishing title is.
  • publisher-name, as it was included in the source, not necessarily the publisher of the currently-publishing title.
The article-meta element can contain a variety of data. Elements that are reliably present:
  • article-id
    • When pub-id-type="doi", this is a registered DOI
    • When pub-id-type=”jstor”, this is NOT a registered DOI
    • IDs that are not labeled as DOI have not been registered with CrossRef by JSTOR and should not be used with DOI resolution services.
  • article-categories may be present, describing or grouping this article with others.
  • article-title, if there was a title given in the source.
  • contrib-group, for authors and possibly other creator roles identified in the source, as contrib elements within
  • At least one pub-date element, with as much date information as the print source or electronic source contained. Some journals at times have multiple dates on a given issue, and these are all listed.
  • An fpage ("first page") where the source provided it.
  • Where the article is a review article, there may be one or more product elements describing works reviewed. A permissions element containing whatever copyright-statement or license statement the source provided.
  • A self-uri element with an xlink:href containing a URL to the landing page for that publication at JSTOR.
  • A kwd-group is included if the source supplied one.
  • A custom-meta-group containing a custom-meta representing the languages present in the article. JSTOR's digitized backfile metadata tracks the important languages present, allowing for there to be more than one, and they are listed here in roughly order of quantity.

Book and Research Report Metadata

Note that we do not provide metadata files for individual book chapters, so if your dataset includes book chapters you will get the full book XML each time a chapter record is processed. However, we do full-text and n-grams at the chapter level.

The books files follow the draft BITS 0.2 tag set (as retrieved in the DTD form available in May of 2013). The books files were made by transforming an NLM Books 3 into the BITS 0.2, and sometimes include a generated table-of-contents for the book when the source did not include an explicit table of contents.

Each book file has a /book/collection-meta, a /book/book-meta, and a /book/book-body.

The book-meta element reliably includes:
  • A book-id book-id-type="jstor", containing a string that is NOT a registered DOI.
  • At least one subject-group element, the subject sub-elements of which will have a content-type attribute. The value of the content-type attribute describes the data in the subject:
    • call-number gives the LOC Call Number for the book
    • lcsh gives LOC subject headings captured from the verso of the book itself.
    • discipline gives a string corresponding to the browse-level subjects or discipline names used on JSTOR.org.
  • A book-title-group/book-title
  • A contrib-group, containing contrib elements. The BITS content model for contrib allowed us to use name elements on our source data, so name is usually present.
  • A pub-date element
  • One or more isbn elements, with the ISBN values JSTOR has captured for the book.
  • A publisher element with the publisher-name and publisher-loc.
  • A permissions element with copyright or license information.
  • A self-uri element with an xlink:href containing a URL to the landing page on JSTOR.org for the book.
  • A counts element, listing the number of pages in the actual book.
  • A custom-meta-group for languages present, similar to that in journals above.
The book-body will contain the supplied or generated TOC, including:
  • A tree of book-part elements, the leaf book-parts corresponding to some portion of the whole book as defined by JSTOR or the publisher, in the order they appear in the book.
  • The leaf book-part elements may contain:
    • A title-group/title
    • A contrib-group, if the portion of the book has authors different from the book as a whole
    • An fpage
    • An abstract (The abstract-type attribute, when present, designates what kind of text is in the abstract. "Extract" means that the text is just the first 100 words.)

N-grams

We are using the Apache Lucene StandardTokenizer to generate n-grams in Data for Research datasets with these parameters:
  • Apache Lucene StandardTokenizer
  • LowerCaseFilter : compares case insensitive characters
  • StopFilter = ENGLISH_STOP_WORD_SET : {“a”, “an”, “and”, “are”, “as”, “at”, “be”, “but”, “by”, “for”, “if”, “in”, “into”, “is”, “it”, “no”, “not”, “of”, “on”, “or”, “such”, “that”, “the”, “their”, “then”, “there”, “these”, “they”, “this”, “to”, “was”, “will”, “with”}

N-grams are delivered as tab-delimited gram with count rows, tab-delimited .txt file, sorted in descending order by count.

OCR/Full-text

The document full-text is delivered in a separate .txt file. The markup will indicate each page such as <page sequence="1"></page> and the full contents are wrapped in a <plain_text> tag.

It’s important to note that JSTOR incorporates metadata and OCR text from many different sources. We work to mitigate any inconsistencies, but researchers may encounter issues with datasets, such as duplicate OCR, missing OCR, or incomplete metadata. We are unable to assist with the clean-up of datasets due to OCR errors or missing data.