Copy
Grouped data now available...
View this email in your browser
The LibCrowds blog banner

Grouped data now available

With four Convert-a-Card projects complete, we now have some large datasets that between them contain the details of thousands of catalogue records that are nearly ready to be ingested. Of course, some cleaning and checking of this data is required before these new records created.

We began by using tools such as OpenRefine to organise the data, separating tasks by the number of matching shelfmarks and OCLC numbers. While it is certainly a handy piece of software, it quickly became clear that this wasn't going to be sustainable in the long run. So, another PyBossa plugin has been written to split the data programmatically, also making it publicly available on the LibCrowds data page. Not only does this help us organise the data but it also gives our volunteers a better indication of what is going on behind the scenes! After all, it's our volunteers that are putting in all the hard work, so we're keen to reveal as much as possible about the whole process. Indeed, if there is anything else you would like to know just head over to the LibCrowds Community for a chat!

The Convert-a-Card process
The grouped data that is now available for download will assist with points six and seven above.

If you download a grouped dataset, you will currently find four documents. There are also a number of enhancements in scope, such as including a PDF with a full set of statistics regarding each project. The grouped datasets are updated daily.

{project_name}_matched.csv

Contains the data that can be ingested directly, after the relevant spot checks are completed (see below). The criteria for a matched record are as follows:

  1. Two or more volunteers selected the same WorldCat record.
  2. Two or more volunteers entered the same shelfmark.
  3. No additional comments were given.
  4. The OCLC number is not duplicated in another task.
  5. The shelfmark is not duplicated in another task.

{project_name}_spot_checks.pdf

Contains a random sample of 20 cards taken from the matched dataset. Each page contains an image of the card, the associated shelfmark, a link to the WorldCat record, and some additional data.

{project_name}_not_matched.pdf

Contains those records where three people failed to locate a matching WorldCat record. This data is formatted in such a way that it can be directly ingested back into the LibCrowds system and will provide the tasks for an alternative type of project.

{project_name}_ambiguous.pdf

Contains those records that fall into neither of the previous categories. So, this will include records where people have selected different WorldCat records, the shelfmarks don't match, there are additional comments to be considered, or there are duplicates involved. This data will required further checking, which will be performed by the relevant British Library curators. The data is formatted in such a way that it can be ingested back into the LibCrowds system and will provide the tasks for a special staff project. This project will allow the curators to compare each card image against all possible matches and accept any that are appropriate. The output from this will be a further set of successful and unsuccessful matches.

We are really very close to ingesting the first new records into our database, while ensuring we have a solid process in place to reduce the time between project completion and visible results, so watch this space!

Head over to LibCrowds to help create more electronic records and improve access so the collections of the British Library.
Contribute Now
Share
Tweet
Forward
Copyright © 2015 LibCrowds, All rights reserved.


Want to change how you receive these emails?
You can update your preferences or unsubscribe from this list

Email Marketing Powered by Mailchimp