Skip to content

Add Wikipedia as data source#167

Merged
TimidRobot merged 16 commits into
creativecommons:mainfrom
oree-xx:wikipedia
Oct 21, 2025
Merged

Add Wikipedia as data source#167
TimidRobot merged 16 commits into
creativecommons:mainfrom
oree-xx:wikipedia

Conversation

@oree-xx

@oree-xx oree-xx commented Oct 10, 2025

Copy link
Copy Markdown
Contributor

Fixes

Description

Added wikipedia_fetch.py to implement wikipedia as a data source, currently it counts the number of articles across different instances of wikipedia(language).
Wikipedia mainly uses the Creative Commons Attribution-Share Alike 4.0 license as primary license. We can fetch the following data:

  • Count of articles by language: This tells the usage of CC_BY_SA_4.0 across the different instances of wikipedia.
  • Count of articles by categories (English wikipedia): Breakdown of the use of CC_BY_SA 4.0 by categories.
  • Count of page views by categories: We can get the most viewed set of articles by categories.

Technical details

  • I used the structure in github_fetch.py to implement the wikipedia_fetch.py. Similiar functions are used and I leveraged the code sample in the Wikipedia API documentation to structure the parameters for querying Wikipedia.
  • The parameters include rightsinfoto get information about the licenses used and statistics for the count of articles associated with the licenses.
  • To get the count of articles across all the wikipedia languages I used the sitematrix parameter.

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI.
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@TimidRobot TimidRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I expect that more meaningful data is possible (in addition to default license and total articles):

  • How do licenses differ across articles? Does API expose multiple licenses or just primary/default?
  • How do licenses and counts compare across the different instances of Wikipedia (different languages)?
  • etc.

Comment thread scripts/1-fetch/wikipedia_fetch.py
Comment thread scripts/1-fetch/wiki_fetch.py Outdated
Comment thread scripts/1-fetch/Wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/Wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/Wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/Wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/Wikipedia_fetch.py Outdated
@oree-xx oree-xx marked this pull request as ready for review October 12, 2025 12:06
@oree-xx oree-xx requested review from a team as code owners October 12, 2025 12:06
@oree-xx oree-xx requested review from TimidRobot and possumbilities and removed request for a team October 12, 2025 12:07
@oree-xx

oree-xx commented Oct 14, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot I was trying to get the count of articles by categories. But it seems a bit tricky because the structure of the categories are hierarchial. I have to loop into each sub categories recursively. Do you think I should give it a try?

@TimidRobot TimidRobot self-assigned this Oct 14, 2025
@TimidRobot

Copy link
Copy Markdown
Member

@TimidRobot I was trying to get the count of articles by categories. But it seems a bit tricky because the structure of the categories are hierarchial. I have to loop into each sub categories recursively. Do you think I should give it a try?

Please give an example of the categories

@oree-xx

oree-xx commented Oct 14, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot these categories: https://en.wikipedia.org/wiki/Category:Main_topic_classifications.
But I have done the count of article by languages in my PR.

@oree-xx oree-xx changed the title [WIP] Added wikipedia as data source Added wikipedia as data source Oct 14, 2025
@TimidRobot

Copy link
Copy Markdown
Member

@TimidRobot these categories: https://en.wikipedia.org/wiki/Category:Main_topic_classifications. But I have done the count of article by languages in my PR.

I don't think the categories provide meaningful information on how the CC Legal Tools are being used and can be skipped.

@TimidRobot TimidRobot changed the title Added wikipedia as data source Add wikipedia as data source Oct 15, 2025
@TimidRobot TimidRobot changed the title Add wikipedia as data source Add Wikipedia as data source Oct 15, 2025
@oree-xx

oree-xx commented Oct 15, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot ohh okay. Since wikipedia uses only one tool. Any idea on what way I can analyse the license? Maybe I can think in that direction.
Also any comment on the count of article by language?

Comment thread .pre-commit-config.yaml Outdated
Comment thread data/2025Q4/1-fetch/wikipedia_count_by_languages.csv Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py
@TimidRobot

Copy link
Copy Markdown
Member

@TimidRobot ohh okay. Since wikipedia uses only one tool. Any idea on what way I can analyse the license? Maybe I can think in that direction. Also any comment on the count of article by language?

Count by language is good.

I don't currently have any ideas about further analysis. Please resolve outstanding comments and then I'll re-review.

@oree-xx

oree-xx commented Oct 15, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot I have made the changes. I also followed the instruction for running scripts. I don't quite understand when you said I should make the script executable.

@TimidRobot

TimidRobot commented Oct 15, 2025

Copy link
Copy Markdown
Member

@TimidRobot I have made the changes. I also followed the instruction for running scripts. I don't quite understand when you said I should make the script executable.

@oree-xx

Please see:

If a file is not executable, you have to specify the interpreter yourself, for example:

pipenv run python ./scripts/1-fetch/github_fetch.py -h

An executable file can be run directly, for example:

pipenv run ./scripts/1-fetch/github_fetch.py -h

The first line (called the shebang) tells the shell what to use to execute it (the interpreter):

#!/usr/bin/env python

@oree-xx

oree-xx commented Oct 15, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot Thank you for the explanation! I think I am able to do it now. I just made a push, please check if I did it right.

Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated

@TimidRobot TimidRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep up the good work, nearly there

Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/shared.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
@oree-xx

This comment was marked as outdated.

@TimidRobot TimidRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

Including the names of the languages in English will make reporting clearer.

Comment thread scripts/1-fetch/wikipedia_fetch.py
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/shared.py Outdated
Comment thread scripts/shared.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
@oree-xx

oree-xx commented Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot I have made the changes.

@TimidRobot TimidRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point it does everything it needs to. Now we're just making it easier to use.

Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
Comment thread scripts/1-fetch/wikipedia_fetch.py Outdated
@oree-xx

oree-xx commented Oct 20, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot I have updated the logic for when language_name_en is None or empty and I tested it.

@TimidRobot TimidRobot left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Thank you 🙏🏻

@TimidRobot TimidRobot merged commit b023018 into creativecommons:main Oct 21, 2025
@github-project-automation github-project-automation Bot moved this from In review to Done in TimidRobot Oct 21, 2025
@oree-xx

oree-xx commented Oct 21, 2025

Copy link
Copy Markdown
Contributor Author

@TimidRobot Thank you for the inputs also!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

Add Wikipedia as a new data source

3 participants