EmptikBest wrote: ↑09/09/2023, 6:07
Greetings to all fellow members,
I gathered a bunch of databases (merged some with "type" so probably a LOT of doubles), to create what I call the "Ultimate Database".. Including:
- "Complete-10min+6sec" from some website I cant remember :(
- "Complete-60min+15sec" from some website I cant remember :(
- Lichess Elite Database thanks to nikonoel! (Note it is 38GB uncompressed because doubles were not removed, I dont know how to)
- "Top40-1min-23.12.2022" from some website I cant remember :(
- "Turnier-NN-60+0.6_gesamt-03.06.2022" from some website I cant remember :(
Link:
https://pixeldrain.com/u/s2rtpS94
Do not be fooled by the 6.86GB compressed size (It took ~40 minutes to compress at maximum compression level using 7-Zip on 28 Threads and 24GB RAM), it is 61.8 GB uncompressed...
P.S: If somebody could DM me on how to remove doubles from a PGN file and how to merge them with something faster than "type" that would be great, then I would upload a cleaned DB and probably add ICCF, FICS etc
Believe it or not, there wouldn't be many doubles. The CCRL, Chesscom, FGRL (which is what I think that is), and Lichess stuff are all separate entities on their own. So most of these would stay. The Caissabase is the Millionbase, the KingBase, TWIC and PGN Mentor, so I guess that would get you a lot. You would need to use PGN Split, which can be found at the Lichess Open Database website, to break the PGN into perhaps 10 GB pieces, so you can open it in Scid. Each one should import in a few minutes -- as opposed to ChessBase which will take about an hour per gig -- and you can search for "twins" (as they call them) in there, via the maintenance menu. There's a lot of stuff that you do in Scid with the context menu, and that isn't readily apparent. But if you find doubles it will automatically select all of them, at which point you should right-click and choose to negate the filter, which will only show what didn't turn up. Then right-click and choose to copy the filter games to PGN. Afterward you can put them back together.
One good thing about ChessBase, even though it's tough to get stuff in there, is that you can differentiate in two unique ways: you can filter out everything but the strong games, and you can add beauty scores and then filter based on that. That's what the Elegance DB is. It gets rid of about 95% of the games.
Ultimately I tend to split DBs up according to OTB, Online, Corr, and Engine. This is sometimes important because if a database is being used as material for an opening book, the book author may not want to mix those different types. But there are lots of reasons to make a DB, and this sounds like a really interesting project.