Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cms-2016-simulated-datasets: add folder, file list, categorisation #163

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

jmhogan
Copy link
Contributor

@jmhogan jmhogan commented Apr 5, 2023

Adding the cms-2016-simulated-datasets folder. The inputs subfolder contains the dataset listing as of late March 2023, and the categorisation script has been extensively updated to handle everything.

Copy link
Member

@katilp katilp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiborsimko @jmhogan Maybe just merge the folder structure, dataset lists in inputs (although it will be updated closer to the release), interface.py and the categorisation.py
@nancyhamdan and @joudmas will work on updating the other python scripts and then upload.

@tiborsimko
Copy link
Member

Maybe just merge the folder structure, dataset lists in inputs (although it will be updated closer to the release), interface.py and the categorisation.py

Yes, I can do that. A quick diff shows that printer.py file also slightly changed:

$ git checkout pr-163
$ cd cms-2016-simulated-datasets
$ $ colordiff -ru code/ ../cms-YYYY-simulated-datasets/code
Only in code/: categorisation.py~
diff -ru code/printer.py ../cms-YYYY-simulated-datasets/code/printer.py
--- code/printer.py     2023-10-26 15:59:00.896461652 +0200
+++ ../cms-YYYY-simulated-datasets/code/printer.py      2022-11-02 17:39:55.955236462 +0100
@@ -49,7 +49,9 @@
           ' the rules in the categorisation script should be adjusted and'
           ' the script rerun.')
     print('')
-    print('See [#157](https://github.com/cernopendata/data-curation/issues/157) for more context.')
+    print('See [#1229](https://github.com/cernopendata/opendata.cern.ch/issues/1229)'
+          ' and [this page](https://demo.codimd.org/s/BkoBknkqQ#)'
+          ' for more context.')
     print('')
     print('Generated on', datetime.datetime.now().strftime("%d-%m-%Y %H:%M:%S"))
     print('')
Only in ../cms-YYYY-simulated-datasets/code: __pycache__

@tiborsimko
Copy link
Member

@jmhogan BTW your starting point for the categorisation changes was the cms-YYYY-simulated-datasets categorisation work, but there were some changes in cms-2015-simulated-datasets categorisation done by Osama for the 2015 MC release, so that file may be a bit more up to date. Please see the differences:

$ git checkout master
$ colordiff -uw cms-YYYY-simulated-datasets/code/categorisation.py cms-2015-simulated-datasets/code/categorisation.py

I'm appending them for convenience below:

--- cms-YYYY-simulated-datasets/code/categorisation.py	2023-10-26 16:02:53.268938768 +0200
+++ cms-2015-simulated-datasets/code/categorisation.py	2023-08-18 15:13:13.661327989 +0200
@@ -74,7 +74,6 @@
         re.search(r'/branon', title_lower) or  # extra-dimensions, brane models
         re.search(r'/stringball', title_lower) or
         re.search(r'/qbh', title_lower) or  # Quantum Black Hole
-        re.search(r'/unpart', title_lower) or
         re.search(r'blackhole', title_lower)):  # Quantum Black Hole also??
         return 'Exotica/Extra Dimensions'
 
@@ -84,7 +83,6 @@
           re.search(r'/dmz', title_lower) or  # darkmatter Z?
           re.search(r'/dms', title_lower) or  # darkmatter scalar
           re.search(r'/dmv', title_lower) or  # darkmatter vector
-          re.search(r'/Monotop', title) or
           re.search(r'DMJets', title)):       # darkmatter Jets?
         return 'Exotica/Dark Matter'
 
@@ -136,9 +134,7 @@
           re.search(r'/wrto',       title_lower) or
           re.search(r'/monolepton', title_lower) or
           re.search(r'/spin0plus',  title_lower) or
-          re.search(r'/spin2ph',    title_lower) or
-          re.search(r'/extendedweakisospin',    title_lower) or
-          re.search(r'/hscp',    title_lower)):
+          re.search(r'/spin2ph',    title_lower)):
         return 'Exotica/Miscellaneous'
 
     elif ('susy' in title_lower or
@@ -179,8 +175,6 @@
           re.search(r'primejettoth', title_lower) or  # TprimeJetToTH FIXME: SM Higgs from T' is here?
           re.search(r'hminus', title_lower) or
           re.search(r'sms[-]?higgs', title_lower) or  # sms higgs
-          re.search(r'spin0to', title_lower) or
-          re.search(r'xxto', title_lower) or
           re.search(r'hplus', title_lower)):
         return 'Higgs Physics/Beyond Standard Model'
 
@@ -200,12 +194,6 @@
             return 'Higgs Physics/Standard Model'
             # FIXME gravitino going to SM Higgs ctegory.
 
-    elif ('_HInt_' in title   or
-          'ttHJetTo' in title or
-          'Hincl' in title or
-          'GluGluHTo' in title):   
-        return 'Higgs Physics/Standard Model'
-
     elif (re.search('GammaGammaTo(E|Mu|Tau)*_(Inel|Elastic|SingleDiss)', title) or  # gamma gamma -> mu+ mu- etc reactions which involve elastically scattered protons
                                                                                     # SingleDiss, means Single Diffractive Dissociation
           re.search('/singlediffractive[zw]?', title_lower) or
@@ -217,13 +205,7 @@
     elif (re.search('/minbias', title_lower)):
         return 'Standard Model Physics/Minimum Bias'
 
-    elif (re.search(r'gun', title_lower) or # particle gun
-          re.search(r'/single', title_lower) or
-          re.search(r'/double', title_lower) or
-          re.search(r'/muminus', title_lower) or
-          re.search(r'/muplus', title_lower) or
-          re.search(r'/doubleelectron', title_lower) or  # Is this an electron gun?
-          re.search(r'/singlepi', title_lower)): 
+    elif (re.search(r'gun', title_lower)):  # particle gun
         return 'Physics Modelling'
 
     elif re.search(r'/dy', title_lower):
@@ -259,21 +241,6 @@
           re.search(r'/wminusto', title_lower) or  # W- to
           re.search(r'/wmto', title_lower) or      # W- to
           re.search(r'/z*to', title_lower) or      # ZZ To
-          re.search(r'/eeg', title_lower) or
-          re.search(r'/photonindbkg', title_lower) or
-          re.search(r'/wgjjto', title_lower) or
-          re.search(r'/wzjto', title_lower) or
-          re.search(r'/zzj', title_lower) or
-          re.search(r'/wlljjto', title_lower) or
-          re.search(r'/wzjj', title_lower) or
-          re.search(r'/glugluwwto', title_lower) or
-          re.search(r'/mumug', title_lower) or
-          re.search(r'/vvto', title_lower) or
-          re.search(r'/wpwp', title_lower) or
-          re.search(r'/wmwm', title_lower) or
-          re.search(r'/wbjets', title_lower) or
-          re.search(r'/zllg', title_lower) or
-          re.search(r'/znunug', title_lower) or
           re.search(r'/[wz]to[emunu]*', title_lower)):  #  W/Z to E,Mu,Nu
         return 'Standard Model Physics/ElectroWeak'
 
@@ -293,15 +260,15 @@
           'tt_mtt-1000' in title_lower or
           'tt_mtt-700' in title_lower or
           re.search(r'/[tbar]*_.+_[stuw]+-channel', title_lower) or  # T_bla_s/t/u/w-channel
-          re.search(r'/ST_', title) or
-          re.search(r'/TTZTo', title) or
-          re.search(r'/tZq', title) or
-          re.search(r'/TTTo', title) or
-          re.search(r'/ttwjets', title_lower) or
-          re.search(r'/ttbb', title) or
           re.search(r'/t+_', title_lower)):
         return 'Standard Model Physics/Top physics'
 
+    elif (re.search(r'/muminus', title_lower) or
+          re.search(r'/muplus', title_lower) or
+          re.search(r'/doubleelectron', title_lower) or  # Is this an electron gun?
+          re.search(r'/singlepi', title_lower)):  # is this right? FIXME
+        return 'Standard Model Physics/Miscellaneous'
+
     elif ('Heavy-Ion Physics' in title or
           re.search('reggegribov_', title_lower)):
         return 'Heavy-Ion Physics'
@@ -316,10 +283,6 @@
           re.search(r'/bsto', title_lower) or
           re.search(r'/chib0', title_lower) or
           'etabto' in title_lower or  # Eta_b To
-          'InclusivebtoMu' in title or
-          'InclusivectoMu' in title or
-          'DsToTau' in title or
-          'DStar' in title or
           'xibstar0' in title_lower):
         return 'B physics and Quarkonia'
 

Do you think some of the above changes may be interesting to "replay" on top of your branch? Or have you covered everything already?

jmhogan and others added 3 commits October 26, 2023 16:24
Fixes several miscellaneous cases in the dataset categorisation.
Updates categorisation for 2016 simulated data.
Adds initial structure for CMS 2016 simulated dataset curation work.

Adds categorisation scripts.
@tiborsimko
Copy link
Member

dataset lists in inputs (although it will be updated closer to the release)

Actually this was not part of the pull request files, so I'm merging without inputs, just the categorisation scripts and 2016 MD results....

Copy link
Member

@tiborsimko tiborsimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, rebased and amended as discussed to keep only categorisation changes. I have also added you to the list of authors with a link to your ORCID profile.

@tiborsimko tiborsimko merged commit 3611937 into cernopendata:master Oct 26, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants