From 54d6a0614719bb8a4543ebb4a5016ec06dc65510 Mon Sep 17 00:00:00 2001 From: Cleo Young Date: Mon, 25 Nov 2024 12:15:56 -0600 Subject: [PATCH] first documentation update for the archubs recipe --- docs/recipes/R-01_arcgis-hubs.md | 93 +++++++++++++++++++++++--------- 1 file changed, 69 insertions(+), 24 deletions(-) diff --git a/docs/recipes/R-01_arcgis-hubs.md b/docs/recipes/R-01_arcgis-hubs.md index 7d53d0d8..21f8d1cc 100644 --- a/docs/recipes/R-01_arcgis-hubs.md +++ b/docs/recipes/R-01_arcgis-hubs.md @@ -43,24 +43,21 @@ We maintain a list of active ArcGIS Hub sites in GBL Admin. 1. Go to the Admin (https://geo.btaa.org/admin) dashboard 2. Filter for items with these parameters: - - Resource Class: Websites - - Accrual Method: DCAT US 1.1 + - Resource Class: Websites + - Accrual Method: DCAT US 1.1 3. Select all the results and click Export -> CSV 4. Download the CSV and rename it `arcHubs.csv` -!!! info - - Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are: - - * **ID**: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset. - * **Title**: The name of the Hub. This is transferred to the "Provider" field for each dataset - * **Publisher**: The place or administration associated with the portal. This is applied to the title in each dataset in brackets - * **Spatial Coverage**: A list of place names. These are transferred to the Spatial Coverage for each dataset - * **Member Of**: a larger collection level record. Most of the Hubs are either part of our [Government Open Geospatial Data Collection](https://geo.btaa.org/catalog/ba5cc745-21c5-4ae9-954b-72dd8db6815a) or the [Research Institutes Geospatial Data Collection](https://geo.btaa.org/catalog/b0153110-e455-4ced-9114-9b13250a7093) - - However, it is not necessary to take extra time and manually remove the extra fields, because the Jupyter Notebook code will ignore them. +!!! info +>Exporting from GBL Admin will produce a CSV containing all of the metadata associated with each Hub. For this recipe, the only fields used are: +>- **ID**: Unique code assigned to each portal. This is transferred to the "Is Part Of" field for each dataset. +>- **Title**: The name of the Hub. This is transferred to the "Provider" field for each dataset +>- **Publisher**: The place or administration associated with the portal. This is applied to the title in each dataset in brackets +>- **Spatial Coverage**: A list of place names. These are transferred to the Spatial Coverage for each dataset +>- **Member Of**: a larger collection level record. Most of the Hubs are either part of our [Government Open Geospatial Data Collection](https://geo.btaa.org/catalog/ba5cc745-21c5-4ae9-954b-72dd8db6815a) or the [Research Institutes Geospatial Data Collection](https://geo.btaa.org/catalog/b0153110-e455-4ced-9114-9b13250a7093) +> However, it is not necessary to take extra time and manually remove the extra fields, because the Jupyter Notebook code will ignore them. ------------------- @@ -97,14 +94,62 @@ We maintain a list of active ArcGIS Hub sites in GBL Admin. ## Step 3: Validate and Clean Although the harvest notebook will produce valide metadata for most of the items, there may still be some errors. [Run the cleaning script](clean.md) to ensure that the records are valid before we try to ingest them into GBL Admin. - - -## Step 4: Upload all records - -1. Review the previous upload. Check the Date Accessioned field of the last harvest and copy it. -2. Upload the new CSV file. This will overwrite the Date Accessioned value for any items that were already present. -3. Use the old Date Accessioned value to search for the previous harvest date. This example uses 2023-03-07: (https://geo.btaa.org/admin/documents?f%5Bb1g_dct_accrualMethod_s%5D%5B%5D=ArcGIS+Hub&q=%222023-03-07%22&rows=20&sort=score+desc) -4. Unpublish the ones that have the old date in the Date Accessioned field -5. Record this number in the GitHub issue for the scan under Number Deleted -6. Look for records in the uploaded batch that are still "Draft" - these are new records. -7. Publish them and record this number in the GitHub issue under Number Added +1. Move the scanned records csv into the cleaned-00 recipe folder +2. In Jupyter Notebook, navigate to the R-00 folder +3. Replace the name of the csv in the first cell in the notebook to reflect the current csv’s filename +4. Run all cells. Takes less than a minute. + +## Step 4: Upload the CSV +>See also: [General Import Instructions](https://gin.btaa.org/geoblacklight_admin_docs/import/) +1. In GBL Admin, select **Imports** in the **Admin Tools** menu at the top of the page, then **New Import**. +2. Enter the Name "[GitbHub Issue Number] ArcGIS Hubs scan YYYY-MM-DD" and the type "BTAA CSV." +3. Click **Choose File** and upload the cleaned scanned records of today’s date (likely still located in the **R-00_clean recipe** folder). +4. Click **Create Import**. Wait! The page may not immediately change. +5. Briefly review the Field Mappings to make sure none of the fields are blank. No changes should be needed. Click **Create Mapping** at the bottom of the page. +> Optional verification: Check that the **CSV Row Count** matches the actual row count of your CSV. (In Excel, select any column header, then find the row count at the bottom of the window. It will be 1 greater than expected because it includes the column header row.) +6. Click **Run Import**, then wait for about 30 minutes or more. Refresh the page to see how many records have been imported so far. The import is complete when the Total count under **Imported Documents** matches the **CSV Row Count**. There is no notification. + +## Step 5: Publish and republish records +#### Convert new records from Draft to Published: +1. Click the GBL*Admin logo to return to the [Documents view](https://geo.btaa.org/admin/documents). +2. Click **Imports** at the top of the left column menu and click on today’s ArcGIS Hub import. +3. Under publication state, click on **Draft**. +4. Record the number of Draft records in a comment on the GitHub issue. This is the “New” stat used in the last step. +5. Check the box to select all, **Select all results that match this search**, select **Bulk Actions** > **Published**, and **Run Bulk Action**. Reload the page to see the results. +#### Republish previously unpublished records that have returned: +1. Return to the [Documents view](https://geo.btaa.org/admin/documents), click on Imports, and select today’s import. +2. Under publication state, click on Unpublished. + - These are items that were found in past scans, then not found and therefore removed in subsequent scans, but have now been found once again in today’s scan. +3. In a comment on the GitHub issue, record the number of unpublished records. This is the “Republished” stat used in the last step. +4. Check the box to select all, Select all results that match this search, and select Bulk Actions > Published, Run Bulk Action. Again, reload the page to see the results. +5. Return to the [Documents view](https://geo.btaa.org/admin/documents), click on Imports, and select today’s import. +6. Under **Publication State**, confirm that “published” is now the only category and that the number of records matches the row count of your CSV upload. +7. Click the red X next to the title of today’s important so that all records are shown. + +## Step 6: Unpublish retired records + +>The purpose of this step is to unpublish from the Geoportal any records that have been removed from their source hub by their owners. When an existing record is found and imported in a new scan, its Date Accessioned is updated. We want to keep records found in the most recent two scans (including today's) but unpublish everything older than that. This should remove intentionally retired records but leave any that were just temporarily unavailable. + +1. If you're not there already, return to the [Documents view](https://geo.btaa.org/admin/documents) so you're viewing *all* records. Scroll down to **Accrual Method** and click **ArcGIS Hub**. +2. Click **Published** under **Publication State** to see all published Hub records. +3. Under **Date Accessioned**, identify any results with a date earlier than the two most recent dates, including today's. (There might be only one or none). Note that they are sorted by number of records, not by date! +4. In a comment on the GitHub issue, record the total number of records on those earlier dates as “Retired.” +5. For each of these earlier dates: Select the date, check the box to select all records, click **Select all results that match this search**, select **Bulk Actions** > **Unpublished**, and click **Run Bulk Action**. Again, reload the page to see the results. +>If you see weird behavior during this step, scroll down to “Notes and Troubleshooting” + +## Step 7: Record what has changed +1. Optional but helpful: save your GitHub comment with the numbers you recorded of new, republished, and retired records. +2. Mandatory: In the right sidebar, expand **Collections** under **Projects**. Put the sum of new + republished in the **Records Added** field, and fill in the **Records Retired** field. +>This information gets exported and added to a monthly report to showcase the way our collection fluctuates. As a general rule of thumb, the total change in records shouldn’t be much more than 100 - if it’s a lot more, try to evaluate why or ask for help. + +## Notes and troubleshooting: +- Sometimes records will appear in the list that you get when you click “published” under publication state and then click the older date accessioned saved queries, but when viewed in the results list, have a red “unpublished” label. (this mismatch can also happen in reverse - if something’s unpublished but you can’t publish it, or if there are any other records are not updated and syncing) +- The record lives in 2 places: admin (database - back end) and geoportal (solr - front end). If it’s corrupted, it “freezes” in the front end. No changes you make to it will apply, and it may display incorrect values. +- This is almost always because of the bounding box, but it can also be because of the date range. +To resolve: +1. Click on the record title +2. Click under “admin view” +3. Scroll down to the Spatial section and either clear or fix the Bounding Box and Geometry fields. + - If you don't know the correct values, it's fine to leave both fields blank. + - If you're going to add corrected values, it's fine to just fill in the Bounding Box field. When you click save, it will automatically generate the geometry based on the bounding box. + - If you know the specific geometry that should applied, you can add it, but it's not important.