Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OK: fix vote scraping logic #4773

Merged
merged 9 commits into from
Jan 16, 2024

Conversation

braykuka
Copy link
Contributor

@braykuka braykuka commented Jan 4, 2024

Description

I've updated the vote scraping logic.

@braykuka braykuka force-pushed the ok-fix-vote-scraping-logic branch from 6ffb1ff to 8c5edd6 Compare January 4, 2024 05:58
@braykuka braykuka force-pushed the ok-fix-vote-scraping-logic branch from 8c5edd6 to a66d1c3 Compare January 4, 2024 06:02
@braykuka
Copy link
Contributor Author

braykuka commented Jan 4, 2024

@jessemortenson I am not sure about motion text. Could you please explain about it?

http://webserver1.lsb.state.ok.us/cf/2023-24%20SUPPORT%20DOCUMENTS/votes/House/SB4_VOTES.HTM
http://webserver1.lsb.state.ok.us/cf/2023-24%20SUPPORT%20DOCUMENTS/votes/Senate/SB4_VOTES.HTM

In my code review, the house vote's motion text should be "DO PASS" and senate's motion text should be empty.
When the motion text is empty, should we skip such votes?

I guess the general vote's motion text was "house passage" or "senate passage".

@jessemortenson
Copy link
Contributor

jessemortenson commented Jan 4, 2024

Sure, let me try to provide some context and please feel free to ask more questions if it is not clear. motion_text is pretty loose, it is just meant to be the best possible (somewhat short) description of the vote. What was the nature of the motion/proposal that was being voted on? Here are some of the most common ones currently in our data, to give examples:

+-------------------------------------------------------------------------+-----+
|motion_text                                                              |count|
+-------------------------------------------------------------------------+-----+
|Third Reading                                                            |39837|
|Do pass.                                                                 |18860|
|Do Pass                                                                  |14251|
|Do pass as amended.                                                      |11421|
|Passage                                                                  |10473|
|Second Reading                                                           |9234 |
|Do pass and be re-referred to the Committee on Appropriations.           |8256 |
|On Third Reading                                                         |8212 |
|Passed                                                                   |7676 |
|Floor Vote                                                               |7604 |
|Do pass, but re-refer to the Committee on Appropriations.                |7194 |
|passed 3rd reading                                                       |6839 |
|Reported favorably out of committee                                      |6736 |
|PASSAGE                                                                  |6246 |
|Do pass as amended, and re-refer to the Committee on Appropriations.     |6171 |
|Assembly Vote                                                            |6076 |
|Do pass as amended and be re-referred to the Committee on Appropriations.|6041 |
|BILL                                                                     |5868 |
|Placed on suspense file                                                  |5838 |
|Do Concur                                                                |5268 |
+-------------------------------------------------------------------------+-----+

http://webserver1.lsb.state.ok.us/cf/2023-24%20SUPPORT%20DOCUMENTS/votes/House/SB4_VOTES.HTM

On this page, I think the motion_text value should be "DO PASS". It looks like a committee is voting on whether to pass a bill out of committee.

http://webserver1.lsb.state.ok.us/cf/2023-24%20SUPPORT%20DOCUMENTS/votes/Senate/SB4_VOTES.HTM

On this page, I actually see two vote events:

  • "THE OKLAHOMA STATE SENATE" took a vote on "THIRD READING" of "SENATE BILL 4" (SB 4). motion_text here should be THIRD READING
  • "COMMITTEE ON TOURISM & WILDLIFE" took a vote on a "DO PASS" motion on "SENATE BILL 4" (SB 4). motion_text here should be DO PASS. I think the reason that "Motion:" appears to be blank is because this is a very common motion... I see that there is a separate line that indicates "Do Pass Motion: Pemberton" ... so I would just guess that if the motion is "DO PASS" then that line is filled out, and the "Motion:" line is blank. Just a guess on that, though.

@braykuka
Copy link
Contributor Author

braykuka commented Jan 6, 2024

@jessemortenson please review it.

@jessemortenson
Copy link
Contributor

Thank you. I think there is some more work to fix motion_text, I found two examples where it seems incorrect to me. I only spot checked a few bills, so I suspect there may are likely more examples. Happy to look at more if you need more guidance.

SB169
http://www.oklegislature.gov/BillInfo.aspx?Bill=SB169&session=2300

SJR 22

http://www.oklegislature.gov/BillInfo.aspx?Bill=SJR22&session=2300

@braykuka
Copy link
Contributor Author

@jessemortenson I've fixed the motion text issue.

@jessemortenson
Copy link
Contributor

I can see how this is getting challenging :) When I ran the full scrape (/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/bin/python -m openstates.cli.update --scrape --fastmode ok bills ), I was running into a smattering of errors, for example:

00:21:56 INFO openstates: save bill SB2 in 2024 as bill_e272f49f-b112-11ee-89f0-29a5f35db3d4.json
00:21:56 INFO scrapelib: GET - 'http://www.oklegislature.gov/BillInfo.aspx?Bill=SB3&session=2400'
00:21:56 INFO scrapelib: GET - 'http://webserver1.lsb.state.ok.us/cf/2023-24%20SUPPORT%20DOCUMENTS/votes/Senate/SB3_VOTES.HTM'
Traceback (most recent call last):
  File "/home/jesse/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jesse/.pyenv/versions/3.9.13/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/lib/python3.9/site-packages/openstates/cli/update.py", line 551, in <module>
    sys.exit(main())
  File "/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/lib/python3.9/site-packages/openstates/cli/update.py", line 541, in main
    report = do_update(args, other, juris)
  File "/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/lib/python3.9/site-packages/openstates/cli/update.py", line 337, in do_update
    report["scrape"] = do_scrape(juris, args, scrapers, active_sessions)
  File "/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/lib/python3.9/site-packages/openstates/cli/update.py", line 124, in do_scrape
    partial_report = scraper.do_scrape(**scrape_args, session=session)
  File "/home/jesse/.cache/pypoetry/virtualenvs/openstates-scrapers-YBlM38KZ-py3.9/lib/python3.9/site-packages/openstates/scrape/base.py", line 233, in do_scrape
    for obj in self.scrape(**kwargs) or []:
  File "/home/jesse/repo/openstates/braykuka/openstates-scrapers/scrapers/ok/bills.py", line 49, in scrape
    yield from self.scrape_chamber(chamber, session, only_bills)
  File "/home/jesse/repo/openstates/braykuka/openstates-scrapers/scrapers/ok/bills.py", line 95, in scrape_chamber
    yield from self.scrape_bill(chamber, session, bill_id, link.attrib["href"])
  File "/home/jesse/repo/openstates/braykuka/openstates-scrapers/scrapers/ok/bills.py", line 209, in scrape_bill
    yield from self.scrape_votes(bill, self.urlescape(link.attrib["href"]))
  File "/home/jesse/repo/openstates/braykuka/openstates-scrapers/scrapers/ok/bills.py", line 349, in scrape_votes
    committee_motion += ": " + do_pass_motion
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'

Process finished with exit code 1

I seem to be able to get past those errors by editing in some quick checks for things being None or xpath not returning anything, etc.. But I am not sure if these edits are actually productive, or if I'm hurting your intended logic. So I'd love for you to look at my diff below and see if you want to put some of this in place, or if it points to issues you can solve more effectively:

diff --git a/scrapers/ok/bills.py b/scrapers/ok/bills.py
index 7af09a5b5..1e49d98fb 100644
--- a/scrapers/ok/bills.py
+++ b/scrapers/ok/bills.py
@@ -284,15 +284,19 @@ class OKBillScraper(Scraper):
             else:
                 chamber = "upper"
 
-            rcs_p = header.xpath(
+            rcs_xpath = header.xpath(
                 "following-sibling::p[contains(., '***')][1]/preceding-sibling::p[contains(., 'RCS#')][1]"
-            )[0]
-            rcs_line = rcs_p.xpath("string()").replace("\xa0", " ")
-            rcs = re.search(r"RCS#\s+(\d+)", rcs_line).group(1)
-            if rcs in seen_rcs:
-                continue
+            )
+            if rcs_xpath:
+                rcs_p = rcs_xpath[0]
+                rcs_line = rcs_p.xpath("string()").replace("\xa0", " ")
+                rcs = re.search(r"RCS#\s+(\d+)", rcs_line).group(1)
+                if rcs in seen_rcs:
+                    continue
+                else:
+                    seen_rcs.add(rcs)
             else:
-                seen_rcs.add(rcs)
+                continue
 
             committee_motion = None
             committees = [
@@ -324,7 +328,7 @@ class OKBillScraper(Scraper):
                 "Subcommittee",
             ]
 
-            motion_text = motions.get(rcs)
+            motion_text = motions.get(rcs, '')
             committee_motion = None
 
             if "Do Pass" in motion_text or "Committee" in motion_text:
@@ -345,6 +349,9 @@ class OKBillScraper(Scraper):
                             continue
 
                     if "motion by Senator" in line_text:
+                        if committee_motion is None:
+                            # we are assuming committee motion below is a string, lets be sure
+                            committee_motion = ''
                         do_pass_motion = line_text.strip().title()

@braykuka
Copy link
Contributor Author

@jessemortenson Thanks for your update. I've fixed the issue. Please review it.

@jessemortenson
Copy link
Contributor

Thanks Bray - first spot check I did showed an error that I think we saw before: the second vote (Committee vote) on a given page incorrectly reflects the counts and votes of the full body vote. Details:

HB2943
http://www.oklegislature.gov/BillInfo.aspx?Bill=HB2943&session=2400

vote_event_7bfd61b1-b166-11ee-89f0-29a5f35db3d4.json

vote_event_7bfd61af-b166-11ee-89f0-29a5f35db3d4.json

vote_event_7bfd61ae-b166-11ee-89f0-29a5f35db3d4.json

  • this looks correct

vote_event_7bfd61b0-b166-11ee-89f0-29a5f35db3d4.json

  • this looks correct

vote_event_7bfd61ae-b166-11ee-89f0-29a5f35db3d4.json
vote_event_7bfd61af-b166-11ee-89f0-29a5f35db3d4.json
vote_event_7bfd61b0-b166-11ee-89f0-29a5f35db3d4.json
vote_event_7bfd61b1-b166-11ee-89f0-29a5f35db3d4.json

So I think there is at least that bug to fix. Also attaching my full scrape output directory for convenience
ok-2023-full-run-2024-01-11.zip

@braykuka
Copy link
Contributor Author

@jessemortenson I've fixed it. Please review it again.

@jessemortenson
Copy link
Contributor

Thanks, I'm running the scrape now and will check

@jessemortenson
Copy link
Contributor

Looks good, thanks for the fix! Merging in

@jessemortenson jessemortenson merged commit e61049b into openstates:main Jan 16, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants