We at Perseus would like to alert you to the Indiegogo campaign for Alpheios.
This is an effort we strongly support. Please help, if you can.
We at Perseus would like to alert you to the Indiegogo campaign for Alpheios.
This is an effort we strongly support. Please help, if you can.
Earlier this year I blogged about why I think Perseus, and the digital humanities in general, needs infrastructure. In that post I discussed one strategy we’ve been following at Perseus – that of participating in the efforts of the Research Data Alliance (RDA).
I was recently fortunate to be able to attend the Triangle Scholarly Communications Institute, an Andrew W. Mellon Foundation funded workshop where the theme was “Validating and valuing digital scholarship.” While we did not spend much time talking specifically about “infrastructure”, it was implicit in all the solutions we discussed, from specific data models for representing scholarly assertions as a graph, to taxonomies for crediting work, to approaches for assessing quality. Cameron Neylon, a researcher at Curtin University and one of the institute participants blogged some thoughts upon returning home, including the following:
“Each time I look at the question of infrastructures I feel the need to go a layer deeper, that the real solution lies underneath the problems of the layer I’m looking at. At some level this is true, but it’s also an illusion. The answers lie in finding the right level of abstraction and model building (which might be in biology, chemistry, physics or literature depending on the problem). Principles and governance systems are one form of abstraction that might help but it’s not the whole answer. It seems like if we could re-frame the way we think about these problems, and find new abstractions, new places to stand and see the issues we might be able to break through at least some of those that seem intractable today.” (Cameron Neylon, “Abundance Thinking“)
I think this neatly sums up what I hope to gain from participation in the multidisciplinary community of RDA. It’s easy, especially when time and resources are constrained, to get locked into thinking that our problems are unique and that we need to design custom solutions, but when we examine the problem from other perspectives, the abstractions begin to rise to the surface.
One of the challenges I took on when I agreed to serve as a liaison between RDA and the Alliance of Digital Humanities Organizations (ADHO) was to try to engage humanities researchers in participating in designing solutions for data sharing infrastructure. Collaboration is not easy, even when you’re all working on the same team. We have been struggling with this even in our own small group of developers at Perseus and our sister community in Leipzig, the Open Philology Project. In a recent developers’ meeting we talked about how hard it is to find the time to consider another developer’s solution before going off and designing your own.
But the rewards might be great. Imagine if by adding support to your project to acquire Persistent Identifiers for your data objects and for communicating the types of data represented by those identifiers you got for free the ability to, with the same code, interoperate with data objects from thousands of other projects. Or if implementing support for building and managing collections of objects meant that your data could participate seamlessly with collections of other projects. These are not abstract use cases. The New York Times just reported on the millions of dollars museums worldwide are spending to digitize their collections. What if we had a standard approach to managing digital objects in collections that allowed us to easily write software that could build new virtual museums from these collections? And are the requirements for such a solution very different from the needs of the Perseids project to manage collections of diverse types of annotations?
A disillusioned colleague said to me not long ago that nothing we build is going to save the world. While this might be true, I think that we all work too hard to keep reinventing the wheel. It’s going to take us a little longer to build a solution for managing our collections of annotations if we do it in the context of an RDA working group, but to me the benefit of having the expertise and perspective of colleagues from other communities and disciplines, such as researchers from DKRZ (climate science), NoMaD (materials science) and the PECE project (ethnography) is worth the effort. Even if we don’t save the collections of the world, at the very least we’re certain to end up with a solution that is a little better than one we would have built on our own. I urge everyone to be part of the conversation and the solutions. Feel free to start by commenting on the RDA Research Data Collections Working Group case statement or subscribing to contribute to the effort!
Dear Colleagues,
We are very pleased to invite you to attend the conference and workshops
Altertumswissenschaften in a Digital Age: Egyptology, Papyrology and Beyond
Leipzig, November 4-6, 2015
Hashtag: #DHEgypt15
Annotated Corpora | 3D | Input of Hieroglyphics, Demotic, Greek, Coptic
Felix-Klein-Hörsaal, Paulinum (4th floor)
Augustusplatz 10, 04109 Leipzig, Germany
http://www.dh.uni-leipzig.de/wo/dhegypt15/
http://www.gko.uni-leipzig.de/aegyptisches-museum/veranstaltungen/2015.html
Call for Collaboration
Gregory Crane
Leipzig and Tufts Universities
September 1, 2015
This is a preliminary call for comment and for participation.
I expect to be teaching an advanced Greek course in Spring 2016, quite possibly on Thucydides. I would like to explore the possibility of coordinating that teaching with others so that our students can interact and, ideally, collaborate across institutions and even across countries. Courses may also be in advanced Greek but can be on history, archaeology or any other subject relevant to the period. This model of collaboration can be applied quite broadly and others may pursue such collaborations in my subjects. My particular goal is to get something started in North America that focuses on Fifth-Century Greek History and that could feed into the Sunoikisis DC efforts that my colleague Monica Berti began in 2015.
The goals of this collaboration are
(1) to connect students, who may be in small and somewhat isolated classes or in larger lecture classes and who often have little sense that they are participating in a larger community of learners.
(2) to enable students to contribute something as they learn and to leave behind contributions with their names attached. Larger lecture classes could, for example, contribute by analyzing people and places in English translations and thus participate in social network and/or geospatial analysis. Advanced language courses could, for example, contribute by treebanking texts.
(3) to link courses in Europe, North America and elsewhere by exploiting the differing academic schedules in different countries. Students in North America who begin their semesters in January 2016 can aim to develop courses projects and presentations that will feed into courses that begin at Leipzig in April, with US students presenting via videoconference to introduce topics and methods to students in Germany. In January and early February 2017, European students, who began their classes in October 2016, can reciprocate, presenting topics and methods to North American students, who can, in turn, present in April to the next semester of Europeans students. Here we hope not only to help students develop ties across national boundaries but to recognize that learning is an on-going and cumulative process.
One method of collaboration would be to participate in the 2016 version of http://sunoikisisdc.github.io/SunoikisisDC/ (Sunoikisis Digital Classics), which in turn builds up on the long term efforts of the http://sunoikisis.org/ program that Harvard’s Center for Hellenic Studies supports. Collaborations can, however, take various forms. Different classes could, for example, focus upon a single task (e.g., Treebanking selections of Greek historical sources or focusing upon comprehensive Treebanking of a particular author). Different classes might create shared discussion lists or complementary projects (e.g., one class focusing on language and another on the material record). I particularly welcome anyone from the Boston area who would be interested in the possibility of having our students meet jointly in person one or more times.
I welcome both public discussions in venues such as the Digital Classicist and private inquiries (which can be sent to gcrane2008@gmail.com).
August 8, 2015
http://tinyurl.com/nlvhy9b
Comments to munson@dh.uni-leipzig.de
Federico Boschetti, CNR, Pisa
Gregory Crane, Leipzig/Tufts
Matt Munson, Leipzig/Tufts
Bruce Robertson, Mount Allison
Nick White, Durham (UK) (and Tufts during 2014)
A first stab at producing OCR-generated Greek and Latin for the complete Patrologia Graeca (PG) is now available on GitHub at https://github.com/OGL-PatrologiaGraecaDev. This release provides raw textual data that will be of service to those with programming expertise and to developers with an interest in Ancient Greek and Latin. The Patrologia Graeca has as much as 50 million words of Ancient Greek produced over more than 1,000 years, along with an even larger amount of scholarship and accompanying translations in Latin.
Matt Munson started a new organization for this data because it is simply too large to put into
the existing OGL organization. Each volume can contain 250MB or more of .txt and .hocr files, so it is impossible to put everything in one repository or even several dozen repositories. So he decided to create a new organization where all the OCR results for each volume would be contained within its own repository. This will also allow us to add more OCR data as necessary (e.g., from Bruce Robertson, of Mt. Allison University, or from nidaba, our own OCR pipeline) at the volume level.
The repositories are being created and populated automatically by a Python script, so if you notice any problems or strange happenings, please let us know either by opening an issue on the individual volume repository or by sending us an email. This is our first attempt at pushing
this data out. Please let us know what you think.
Available data includes:
Greek and Latin text generated by two open source OCR engines, OCRopus (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr). For work done optimizing OCRopus, see http://heml.mta.ca/lace. For work done optimizing Tesseract, see http://ancientgreekocr.org/. The output format for both engines in hOCR (https://en.wikipedia.org/wiki/HOCR), a format that contains links to the coordinates on the original page image from which the OCR was generated.
OCR results for as many scans of each volume of the Patrologia Graeca that we could find in the HathiTrust. We discovered that the same OCR engine applied to scans of different copies of the same book would generate different errors (even when the scans seemed identical to most human observers). This means that if OCR applied to copy X incorrectly analyzed a particular word, there was a good chance that the same word would be correctly analyzed when the OCR engine was applied to copy Y. A preliminary study of this phenomenon is available here: http://tinyurl.com/ppyfdfj. In most cases, the OCRopus/Lace OCR contains results for four different scanned copies while the Tesseract/AncientGreekOCR output contains results for up to 10 different copies. All of the Patrologia Graeca volumes are old enough that HathiTrust members in Europe and North America can download the PDFs for further analysis. Anyone should be able to see the individual pages used for OCR via the public HathiTrust interface.
Initial page-level metadata for the various authors and works in the PG, derived from the core index at columns 13-114 of Cavallera’s 1912 index to the PG (which Roger Pearse cites at http://www.roger-pearse.com/weblog/patrologia-graeca-pg-pdfs/). A working TEI XML transcription, which has begun capturing the data within the print source, is available for inspection at: https://www.dropbox.com/s/mldhu4okpq4i7r8/pg_index2.xml. All figures are preliminary and subject to modification (that is one motivation for posting this call for help), but we do not expect that the figures will change much at this point. At present, we have identified 658 authors and 4,287 works. The PG contains extensive introductions, essays, indices etc. and we have tried to separate these out by scanning for keywords (e.g., praefatio, monitum, notitia, index). We estimate that there are 204,129 columns of source text and 21,369 columns of secondary sources, representing roughly 90% and 10% respectively. Since a column in Migne contains about 500 words and since the Greek texts (almost) always have accompanying Latin translations, the PG contains up to 50 million words of Greek text but many authors have extensive Latin notes and in some cases no Greek text, so there should be even more Latin. For more information, see http://tinyurl.com/ppyfdfj.
Next Steps
Developing high-recall searching by combining the results for each scanned page of the PG. This entails several steps. First, we need to align the OCR pages with each other — page 611 for one volume may correspond may correspond to page 605 in another, depending upon how the front matter is treated and upon pages that one scan may have missed. Second, we need to create an index of all forms in the OCR-generated text available for each page in each PG volume. Since one of the two OCR engines applied to multiple scans of the same page is likely to produce a correct transcription, a unified index for all the text for all the scans of a page will capture a very high percentage of the words on that page.
Running various forms of text mining and analysis over the PG. Many text mining and analysis techniques work by counting frequently repeated features. Such techniques can be relatively insensitive to error rates in the OCR (i.e., you get essentially the same results if your texts is 96% accurate or if your texts are 99.99% accurate). Many methods for topic modelling and stylistic analysis should produce immediately useful results.
Using the multiple scans to identify and correct errors and to create a single optimized transcription. In most case, bad OCR produces nonsense forms that are not legal Greek or Latin. When one OCR run has a valid Greek or Latin word and others do not, that valid word is usually correct. Where two different scans produce valid Greek or Latin words (e.g., the common confusion of eum and cum), we can use the hOCR feature that allows us to include multiple possibilities. We can do quite a bit encoding the confidence that we have in the accuracy of each transcribed word.
Providing a public error correction interface. One error correction interface already does exist and has been used to correct millions of words of OCR-generated Greek but two issues face us. First, we need to address the fact that we cannot ourselves serve page images from HathiTrust scans. HathiTrust members could use the system that we have by downloading the scans of the relevant volumes to their own servers but that does not provide a general solution. Second, our correction environment deals with OCR for one particular scanned copy. Ideally, the correction environment would allow readers to draw upon the various different scans from different copies and different OCR engines.
Gregory Crane
Professor of Classics and Winnick Family Chair of Technology and Entrepreneurship
Tufts University
Alexander von Humboldt Professor of Digital Humanities
Open Access Officer
University of Leipzig
Comments to gcrane2008@gmail.com
This link points a list and short summary of essays that I have written in 2014 and 2015 on Digital Classics and Digital Humanities. While most of these essays concentrate on Greco-Roman studies, I consider Digital Classics to include all Classical languages (and in practice I would include all historical languages, whether they are considered Classical or not) and the distinction between Greco-Roman and Classical Studies emerges as a periodic subtheme. The general theme of these essays is the challenge that professional students of Greco-Roman culture face in opening up their field so that it can serve the needs of our students and that it can participate in a global dialogue among civilizations.
Gregory Crane
Comments to gcrane2008@gmail.com
[Draft as of July 28, 2015]
Alexander von Humboldt Professor of Digital Humanities
Leipzig University
Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University
This is a short piece reflecting on the need to distinguish publication that actually maximizes the degree to which we make academic work public from the traditional practice of handing control over to commercial, often for-profit, companies that make money by restricting publication. Of course, publishing under an open license has no necessary relation to any particular editorial workflow — researchers can, of course, carry out peer review or any other mechanism they choose. Certainly professors have wide latitude in deciding what does and does not count in a field and we can reward editorial work if we so choose. (I don’t know many academic administrators or members of the public who feel that we don’t produce enough publication and that we could not divert some of that labor to editorial work.) The point is that publishing should (in my view) mean that we have chosen that instrument that will help us advance the intellectual life of society as broadly and as deeply as possible.
I suggest, somewhat tongue-in-cheek, using the term “privashing”, instead of publishing for this traditional habit. Of course, it is an ugly term but that certainly suits the traditional practice in a digital age. The full (not very long) text is here.
I have integrated a number of comments and clarifications from my colleagues in Dariah-DE and TextGrid in revising “The Big Humanities, National Identity and the Digital Humanities in Germany” — a good deal more has gone into making Dariah-DE international than had been clear to me when I first wrote this draft. I also have integrated information from Domenico Fiormonte’s essay, “Towards monocultural (digital) Humanities?” This essay outlines some of the problems of relying upon the Scopus database and rightly emphasizes the challenge of maintaining linguistic (as well as cultural and intellectual diversity). I have no good answers to this last problem except to say that we have our work cut out for us. For more on the Fiormonte essay, see here. This document remains subject to revision in light of comments and suggestions. — GRC July 20, 2015.
Gregory Crane
gcrane2008@gmail.com
[Draft as of July 20, 2015]
Alexander von Humboldt Professor of Digital Humanities
Leipzig University
Professor of Classics
Winnick Family Chair of Technology and Entrepreneurship
Tufts University
[This is hopefully the last essay, at least for a while, in a series that I have published on Digital Classics.]
Abstract: Who is the audience for the work that we professional researchers conduct on Greco-Roman culture? Frequently heard remarks, observed practices and published survey results indicate most of us still assume that only specialists and revenue-generating students really matter. If the public outside of academia does not have access to up-to-date data about the Greco-Roman world, whose problem is it? If we specialists do not believe that we have a primary responsibility to open up the field as is now possible in a digital age, then I am not sure why we should expect support from anyone other than specialists or the students who enroll in our classes. If we do believe that we have an obligation to open up the field, then that has fundamental implications for our daily activities, for our operational theory justifying the existence of our positions, and for the hermeneutics (following a term that is stil popular in Germany) that we construct about who can know what.
The full text is here.
Gregory Crane
gcrane2008@gmail.com
July 17, 2015
I would very much recommend reading Domenico Fiormonte’s piece “Towards a monocultural (digital) Humanities” (with reference as well to the plea by Miran Hladnik for linguistic diversity)” First, Fiormonte’s piece gives some very specific correctives to the Scopus data that I used when I wrote “the Big Humanities, National Identity and the Digital Humanities in Germany” a month ago. Second this piece articulates, in its title and its content, the fraught question of how we support linguistic and cultural diversity in a transnational space. The question before us is, I think, how we maintain our various linguistic and cultural identities while communicating with one another in a dynamic and enriching dialogue across boundaries of language and culture.
For more on this, look here.